'
&
$
%
ADAPTIVE MULTI-PLATFORM SEARCH AND
EXPLOITATION
DARIN CHESTER HITCHINGS
Dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
BOSTON
UNIVERSITY
BOSTON UNIVERSITY
COLLEGE OF ENGINEERING
Dissertation
ADAPTIVE MULTI-PLATFORM SEARCH AND
EXPLOITATION
by
DARIN CHESTER HITCHINGS
B.S., University of California, San Diego, 2000
M.S., Boston University, 2002
Submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
2010
Approved by
First Reader
David A. Casta˜n´on, Ph.D.
Professor of Electrical and Computer Engineering
Second Reader
John Baillieul, Ph.D.
Professor of Mechanical Engineering
Third Reader
Christos G. Cassandras, Ph.D.
Professor of Electrical and Computer Engineering
Fourth Reader
Prakash Ishwar, Ph.D.
Assistant Professor of Electrical and Computer Engineering
Men have become the tools of their tools.
Henry David Thoreau
Acknowledgments
I have had tons of support from professors, teachers, friends and family over the years
since I undertook the goal of obtaining a Ph.D., quite naively, at age 11. First of all,
I want to heartily thank my adviser, Prof. David Casta˜n´on. David is an extremely
gifted applied mathematician and a phenomenally good adviser. I’ve also had many
excellent courses with him. I am incredibly fortunate to have had the chance to work
with David. I also want to thank my committee members Prof. John Baillieul, Prof.
Christos Cassandras, Prof. Prakash Ishwar and Prof. Ajay Joshi for their involvement
and feedback with my dissertation.
I want to thank my father, Todd, for beginning me on the path of mathematics when
I was very young. My dad has a very analytical mind, and he initiated me not only
in word problems and logic, but also in programming. The resources he gave me in
elementary school to work with electronic circuits and program computers have shaped
my career. My beloved mother, Valarie, I thank for all her love and support over the
years. Her appreciation of the fine arts, history and especially languages are precious to
me. She’s been there every step of the way from teaching me how to read and write at
age 5 to copy-editing my dissertation at present. I thank my older brother Sean for the
the good times we have shared, the lessons on Calculus and the competition, which has
made me stronger. Making a BASIC program to call me a turkey at Thanksgiving when
I was 6–7 surely motivated my interest in computers! Sean’s knowledge of literature is
the most encyclopedic of anyone’s I will ever know. I thank my aunt Karen and uncle
Jon for the fine examples they have set with respect to their education, their careers,
their awesome adventures and the gentle path they tread through life. Jay, my younger
brother, has taught me much of what I know about people. He saved me from drowning
when we moved to Florida; I owe him everything. He is greatly respected and adored
by all who know him.
iv
Mr. H.T. Payne, Mr. Ted Brecke, Mrs. Marilyn Hoffacker, Mrs. Patricia Franks, Mr.
Dale Russel, Dr. Oshri Karmon, Prof. Anthony Sebald and Mlle. Mireille Chazalviel
have all been my mentors in life and helped set me on this path; my success is theirs.
I’ve worked closely with my friends Karen Jenkins and Rohit Kumar and very much
appreciate their feedback and advice over the years. Thanks to Ye Wang for getting
me going with Linux (in addition to his friendship)! Thanks very much to Sonia Pujol
for the help with my defense and LATEX. Thanks to George Atia for his friendship and
advice. My thanks to Chris Karl and Shameek Gupta and all my friends inside and
outside the ISS lab (too many to name) who have afforded me an awesome graduate
experience. I love my friends and am very grateful for their camaraderie, support and
discourse. Jessica, my girlfriend, I thank for the support, affection, adventures and for
spoiling me shamelessly with her cuisine `a la fran¸caise.
I would like to thank Prof. Janusz Konrad for the dissertation template and Prof.
Clem Karl for the excellent class. My thanks to both of these professors, Mr. Daniel
Kamalic and Mr. James Goebel for their dedication to IT issues in the ISS lab. I’ve
very much appreciated my conversations with Prof. Selim ¨Unl¨u, Prof. Prakash Ishwar,
Prof. Robert Kotiuga, Prof. Josh Semeter, Prof. Franco Cerrina and Mr. Jeff Murphy
concerning how to improve graduate student life on campus in my capacity as president
of the Student Association of Graduate Engineers (SAGE) for 2009–2010. Thanks to
all the SAGE officers, especially Chris Garay and Ye Wang, in addition to Cheryl Kelly
and Helaine Friedlander for what they gave of themselves to this university.
Last of all, I thank the Air Force Office of Scientific Research and the Office of the
Director, Defense Research & Engineering for providing support for this dissertation
under grants FA9550-06-1-0324, FA9550-07-1-0361 and FA9550-07-1-0528.
v
ADAPTIVE MULTI-PLATFORM SEARCH AND
EXPLOITATION
(Order No. )
DARIN CHESTER HITCHINGS
Boston University, College of Engineering, 2010
Major Professor: David A. Casta˜n´on, Ph.D. ,
Professor of Electrical and Computer Engineering
ABSTRACT
Recent improvements in the capabilities of autonomous vehicles have motivated their
increased use in such applications as defense, homeland security, environmental moni-
toring and surveillance. To enhance performance in these applications, new algorithms
are required to control teams of robots autonomously and through limited interactions
with human operators. In this dissertation we develop new algorithms for control of
robots performing information-seeking missions in unknown environments. These mis-
sions require robots to control their sensors in order to discover the presence of objects,
keep track of the objects and learn what these objects are, given a fixed sensing budget.
Initially, we investigate control of multiple sensors, with a finite set of sensing options
and finite-valued measurements, to locate and classify objects given a limited budget.
The control problem is formulated as a Partially Observed Markov Decision Problem
(POMDP), but its exact solution requires excessive computation. Under the assumption
that sensor error statistics are independent and time-invariant, we develop a class of
algorithms using Lagrangian Relaxation techniques to obtain optimal mixed strategies
using performance bounds developed in previous research. We investigate alternative
vi
Receding Horizon controllers to convert the mixed strategies to feasible adaptive-sensing
strategies, and evaluate the relative performance of these controllers in simulation. The
resulting controllers provide superior performance to alternative algorithms proposed
in the literature, and obtain solutions to large-scale POMDP problems several orders
of magnitude faster than optimal dynamic programming approaches with comparable
performance quality.
We extend our results for finite action, finite measurement sensor control to scenarios
with moving objects. We use Hidden Markov Models (HMMs) for the evolution of
objects, according to the dynamics of a birth-death process. We develop a new lower
bound on the performance of adaptive controllers in these scenarios, develop algorithms
for computing solutions to this lower bound, and use these algorithms as part of a
Receding Horizon controller for sensor allocation in the presence of moving objects.
We also consider an adaptive-search problem where sensing actions are continuous
and the underlying measurement space is also continuous. We extend our previous
hierarchical decomposition approach based on performance bounds to this problem, and
develop novel implementations of Stochastic Dynamic Programming (SDP) techniques
to solve this problem. Our algorithms are nearly two orders of magnitude faster than
previously proposed approaches, and yield solutions of comparable quality.
For supervisory control, we discuss how human operators can work with and augment
robotic teams performing these tasks. Our focus is on how tasks are partitioned among
teams of robots, and how a human operator can make intelligent decisions for task
partitioning. We explore these questions through the design of a game that involves
robot automata controlled by our algorithms and a human supervisor that partitions
tasks based on different levels of support information. This game can be used with
human subject experiments to explore the effect of information on quality of supervisory
control.
vii
Contents
1 Introduction 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Scope and Contributions . . . . . . . . . . . . . . . . . . . . 4
1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Search Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Information Theory for Adaptive Sensor Management . . . . . . . 9
2.1.3 Multi-armed Bandit Problems . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 Human-Robot Interactions and Human Factors . . . . . . . . . . 12
2.1.6 Summary of Background Work . . . . . . . . . . . . . . . . . . . 13
2.2 Sensor Management Formulation and Previous Results . . . . . . . . . . 14
2.2.1 Stationary SM Problem Formulation . . . . . . . . . . . . . . . . 14
2.2.2 Addressing the Search versus Exploitation Trade-off . . . . . . . . 27
2.2.3 Tracing Decision-Trees . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Violation of Stationarity Assumptions . . . . . . . . . . . . . . . 33
2.3 Column Generation And POMDP Subproblem Example . . . . . . . . . 34
3 Receding Horizon Control with Approximate, Mixed Strategies 39
3.1 Receding Horizon Control Algorithm . . . . . . . . . . . . . . . . . . . . 40
3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
viii
4 Adaptive SM with State Dynamics 55
4.1 Time-varying States Per Location . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Time-varying Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Adaptive Sensing with Continuous Action and Measurement Spaces 72
5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Relaxed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Bayesian Objective Formulation . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6 Human-Robot Semi-Autonomous Systems 96
6.1 Optimizing Human-Robot Team Performance . . . . . . . . . . . . . . . 96
6.1.1 Differences Between Human and Machine World Models . . . . . 97
6.1.2 Human Decision-Making Response Time . . . . . . . . . . . . . . 97
6.1.3 Human and Machine Strengths and Weaknesses . . . . . . . . . . 98
6.1.4 Time-Varying Machine Autonomy . . . . . . . . . . . . . . . . . . 100
6.1.5 Machine Awareness of Human Inputs . . . . . . . . . . . . . . . . 100
6.2 Control Structures for HRI . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Verification of Machine Decisions . . . . . . . . . . . . . . . . . . 103
6.3 Strategy Game Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7 Conclusion 111
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 113
A Background Theory 116
A.1 Partially Observable Markov Decision Processes . . . . . . . . . . . . . . 116
A.2 Point-Based Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 124
ix
A.3 Dantzig-Wolfe Decomposition and Column Generation for LPs . . . . . . 125
B Documentation for column gen Simulator 131
B.1 Build Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.2 Running column gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.3 Outputs of column gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
B.4 Program Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
B.5 Global variables in column gen . . . . . . . . . . . . . . . . . . . . . . . 143
B.6 Simulator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
References 170
x
List of Tables
2.1 Example of expanded sensor model for an SEAD mission scenario where
the states are {‘empty’, ‘car’, ‘truck’, ‘SAM’} and the observations are
ys,m = {o1 = ‘see nothing’, o2 = ‘civilian vehicle’, o3 = ‘military vehicle’}
∀ s, m. This setup models a single sensor with modes {u1 = ‘search’,
u2 = ‘mode1’, u3 = ‘mode2’} where mode2 by definition is a higher-
quality mode than mode1. Using mode1, trucks can look like SAMs, but
cars do not look like SAMs. . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Column Generation example with 100 objects. The tableau is displayed in
its final form after convergence. λc
s describe the lambda trajectories up until
convergence. R1 and R2 are resource constraints. γ1 is a ‘do-nothing’ strategy.
Bold numbers represent useful solution data. . . . . . . . . . . . . . . . . . 36
3.1 Observation likelihoods for different sensor modes with the observation symbols
o1, o2 and o3. Low-res = ‘mode1’ and High-res = ‘mode2’. . . . . . . . . . . 43
3.2 Decision costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Simulation results for 2 homogeneous, multi-modal sensors in a search and
classify scenario. str1: select the most likely pure strategy for all locations;
str2: randomize the choice of strategy per location according to mixture prob-
abilities; str3: select the strategy that yields the least expected use of resources
for all locations. See Fig. 3·2 - Fig. 3·4 for the graphical version of this table. 46
xi
3.4 Bounds for the simulations results in Table 3.3. When the horizon is short,
the 3 MPC algorithms execute more observations per object than were used
to compute the “bound”, and therefore, in this case, the bounds do not match
the simulations; otherwise, the bounds are good. . . . . . . . . . . . . . . . . 46
3.5 Comparison of lower bounds for 2 homogeneous, bi-modal sensors (left 3 columns)
versus 2 heterogeneous sensors in which S1 has only ‘mode1’ available but S2
supports both ‘mode1’ and ‘mode2’ (right 3 columns). There is 1 visibility-
group with πi(0) = [0.7 0.2 0.1]T
∀ i ∈ [0..99]. For many of the cases studied
there is a performance hit of 10–20%. . . . . . . . . . . . . . . . . . . . . . 49
3.6 Comparison of sensor overlap bounds with 2 homogeneous, bi-modal, sensors
and 3 visibility-groups. Both configurations use the prior πi(0) = [0.7 0.2 0.1]T
.
Compare and contrast with the left half of Table 3.5, most of the time the two
sensors have enough objects in view to be able to efficiently use their resources
for both the 60% and 20% overlap configurations; only the bold numbers are
different. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Simulation results for 3 homogeneous sensors without using detection but with
partial overlap as shown in Fig. 3·5. See Fig. 3·6 - Fig. 3·8 for the graphical
version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Bounds for the simulations results in Table 3.7. When the horizon is short,
the 3 MPC algorithms execute more observations per object than were used to
compute the bound, and therefore, in this case, the bounds do not match the
simulations; otherwise, the bounds are good. . . . . . . . . . . . . . . . . . . 51
5.1 Performance comparison averaged over 100 Monte Carlo simulations. Re-
laxation is the algorithm proposed in this chapter, while Exact is the
algorithm of [Bashan et al., 2008] . . . . . . . . . . . . . . . . . . . . . . 93
xii
List of Figures
2·1 Illustrative example set of prior-probabilities πi(0) using a MATLAB
“stem” plot for case where N = 9 and D = 2. Assuming objects are
classified into 3 types, the Maximum-Likelihood estimate of locations xi
with i ∈ {3, . . . , 6, 8, 9} is type 0 (empty). The ML estimate of xi for
i ∈ {1, 2, 7} is type 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2·2 Hyperplanes from the Value Iteration algorithm that accompany Fig. 2·3. 23
2·3 Policy graphs for the optimal classification of a location with state {‘non-
military’,‘military’}, two possible actions {‘Mode1’,‘Mode2’}, and two
possible observations {‘y1’,‘y2’}. . . . . . . . . . . . . . . . . . . . . . . 23
2·4 This figure is a plot of expected cost (measurement+classification) versus MD
for 3 different resource levels. The solid (blue) line gives the performance when
the resources are pooled into one sensor and the dashed (red) line gives the
performance when the resources are split across two sensors. . . . . . . . . . 25
2·5 Schematic showing how the master problem coordinates the activities of the
POMDP subproblems using Column Generation and Lagrangian Relaxation.
After the master problem generates enough columns to find the optimal values
for the Lagrange multipliers, there is no longer any benefit to violating one
of the resource constraints and the subproblems (with augmented costs) are
decoupled in expectation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
xiii
2·6 Illustration of “tracing” or “walking” a decision-tree for a POMDP sub-
problem to calculate expected measurement and classification costs (the
individual costs from the total). . . . . . . . . . . . . . . . . . . . . . . 32
2·7 Strategy 1 (mixture weight=0.726). πi(0) = [0.1 0.6 0.2 0.1]’ ∀ i ∈ [0, . . . , 9],
πi(0) = [0.80 0.12 0.06 0.02]T
∀ i ∈ [10, . . . , 99]. The first 10 objects start with
node 5, the remaining 90 start with node 43. The notation [i Ni0 Ni1 Ni2]
indicates the next node/action from node i as a function of observing the 0th,
1st or 2nd observations respectively. . . . . . . . . . . . . . . . . . . . . . . 33
2·8 Strategy 2 (mixture weight=0.274) πi(0) = [0.1 0.6 0.2 0.1]’ ∀ i ∈ [0, . . . , 9],
πi(0) = [0.80 0.12 0.06 0.02]T
∀ i ∈ [10, . . . , 99]. The first 10 objects start with
node 6, the remaining 90 start with node 18. . . . . . . . . . . . . . . . . 34
2·9 The 3 pure strategies that correspond to columns 2, 5 and 6 of Table 2.2. The
frequency of choosing each of these 3 strategies is controlled by the relative
proportion of the mixture weight qc ∈ (0..1) with c ∈ {2, 5, 6}. . . . . . . . . 36
3·1 Illustration of scenario with two partially-overlapping sensors. . . . . . . . . 44
3·2 This figure is the graphical version of Table 3.3 for horizon 3. Simulation results
for two sensors with full visibility and detection (X=’empty’, ’car’, ’truck’, ’mil-
itary’) using πi(0) = [0.1 0.6 0.2 0.1]T
∀ i ∈ [0..9], πi(0) = [0.80 0.12 0.06 0.02]T
∀ i ∈ [10..99]. There is one bar in each sub-graph for each of the three simula-
tion modes studied in this chapter. The theoretical lower bound can be seen
in the upper-right corner of each bar-chart. . . . . . . . . . . . . . . . . . . 47
3·3 This figure is the graphical version of Table 3.3 for horizon 4. . . . . . . . . 47
3·4 This figure is the graphical version of Table 3.3 for horizon 6. . . . . . . . . 48
3·5 The 7 visibility groups for the 3 sensor experiment indicating the number of
locations in each group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xiv
3·6 This figure is the graphical version of Table 3.7 for horizon 3. Situation with
no detection but limited visibility (X=’car’, ’truck’, ’military’) using πi(0) =
[0.70 0.20 0.10]T
∀ i ∈ [0..99]. There were 7 visibility-groups: 20x001, 20x010,
20x100, 12x011, 12x101, 12x110, 4x111. The 3 bars in each sub-graph are for
‘str1’, ‘str2’, ‘str3’ respectively. The theoretical lower bound can be seen in the
upper-right corner of each bar-chart. . . . . . . . . . . . . . . . . . . . . . 52
3·7 This figure is the graphical version of Table 3.7 for horizon 4. . . . . . . . . 52
3·8 This figure is the graphical version of Table 3.7 for horizon 6. . . . . . . . . 53
4·1 An example HMM that can be used for each of the N locations. pa is an
arrival probability and pd is a departure probability for the Markov chain. 57
5·1 Depiction of measurement likelihoods for empty and non-empty cells as a func-
tion of xk0.
√
xk0 gives the mean of the density p(Yk0|Ik = 1). If the cell is
empty the observation is always mean 0 (black curve). . . . . . . . . . . . . 74
5·2 Waterfall plot of joint probability p(Yk0|Ik; xk0) for πk0 = 0.50 for xk0 ∈
[0 . . . 20]. This figure shows the increased discrimination ability that results
from using higher-energy measurements (separation of the peaks). . . . . . . 75
5·3 Graphic showing the posterior probability πk1 as a function of the initial ac-
tion xk0 and the initial measurement value Yk0. This surface plot is for λ = 0.01
and πk0 = 0.20. (The boundary between the high (red) and low (blue) portions
of this surface is not straight but curves towards -y with +x.) . . . . . . . . 76
5·4 Cost function boundary (see Eq. 5.14) with λ = 0.011 and πk0 = 0.18. In the
lighter region two measurements are made, in the darker region just one. (Note
positive y is downwards.) . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xv
5·5 The optimal boundary for taking one action or two as a function of (xk0, Yk0)
(for the [Bashan et al., 2008] cost function) for λ = 0.01 and πk0 = 0.20. The
curves in Fig. 5·6 represent cross-sections through this surface for the 3 x-values
referred to in that figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5·6 This figure gives another depiction of the optimal boundary between taking
one measurement action or two for the [Bashan et al., 2008] cost function. For
all Y (xk0, λ) ≥ 0 two measurements are made (and the highest curve is for the
smallest xk0, see Fig. 5·5 for the 3D surface from which these cross-sections
were taken). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5·7 Two-factor exploration to determine how the optimal boundary between taking
one measurement or two measurements varies for a cell with the parameters
(p, λ) where p = πk0 (for the [Bashan et al., 2008] problem cost function). Two
measurements are taken in the darker region, one measurement for the lighter
region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5·8 Plot of cost function samples associated with false alarms, missed detections
and the optimal choice between false alarms and missed detections (for the
Bayes’ cost function). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5·9 This figure shows cost-to-go function samples as a function of the second
sensing-action xk1 and the second measurement Yk1 for the Bayes’ cost func-
tion. These plots use 1000 samples for Yk1 and 100 for xk1. . . . . . . . . . . 87
5·10 Threshold function for declaring a cell empty (risk of MD) or occupied (risk of
FA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5·11 The 0th stage resource allocation as a function of prior probability. The stria-
tions are an artifact of the discretization of resources when looking for optimal
xk0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xvi
5·12 Total resource allocation to a cell as a function of prior probability. The point-
wise sums of the 0th stage and 1st stage resource expenditures are displayed
here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5·13 Cost associated with a cell as a function of prior probability. For the optimal
resource allocations, there is a one-to-one correspondence between the cost of
a cell and the resource utilized to sense a cell. . . . . . . . . . . . . . . . . 92
5·14 Cost-to-go from πk1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5·15 Optimal stage 1 energy allocations. . . . . . . . . . . . . . . . . . . . . . . 94
5·16 Stage 0 energy allocation versus prior probability . . . . . . . . . . . . . . . 95
6·1 Graphical User Interface (GUI) concept for semi-autonomous search and
exploitation strategy game. . . . . . . . . . . . . . . . . . . . . . . . . . 106
A·1 Hyperplanes representing the optimal Value Function (cost framework)
for the canonical Wald Problem [Wald, 1945] with horizon 3 (2 sensing
opportunities and a declaration) for the equal missed detection and false
alarm cost case: FA=MD. . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A·2 Decision-tree for the Wald Problem. This figure goes with Fig. A·1. . . . 119
A·3 Example of 3D hyperplanes for a value function (using a reward formu-
lation for visual clarity) for X = {‘military’,‘truck’,‘car’,‘empty’}, S = 1,
M = 3 for a horizon 3 problem. The cost coefficients for the non-military
vehicles were added together to create the 3D plot. This figure and
Fig. A·4 are a mixed-strategy pair. . . . . . . . . . . . . . . . . . . . . . 122
A·4 Example of 3D hyperplanes representing the optimal value function re-
turned by Value Iteration. The optimal value is the convex hull of these
hyperplanes. This figure and Fig. A·3 are a mixed-strategy pair (see
Section 2.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xvii
B·1 Sequence diagram of the startup process in column gen. . . . . . . . . . 159
B·2 Sequence diagram of how a sensor plan is constructed in column gen. . . 162
B·3 Sequence diagram of the update cycle in column gen. . . . . . . . . . . . 163
B·4 Interface for CSimulator class. . . . . . . . . . . . . . . . . . . . . . . . 166
B·5 Interface for CVehicle class. . . . . . . . . . . . . . . . . . . . . . . . . . 167
B·6 Interface for CGrid and CCell classes. . . . . . . . . . . . . . . . . . . . 168
B·7 Interface for CTask class. . . . . . . . . . . . . . . . . . . . . . . . . . . 169
xviii
List of Domain Specific Abbreviations
EO . . . . . . . . . . . . . . . . . . . . . . . . Electro-Optical
FOV . . . . . . . . . . . . . . . . . . . . . . . . Field of View
FTI . . . . . . . . . . . . . . . . . . . . . . . . Fixed Target Indicator
HRI . . . . . . . . . . . . . . . . . . . . . . . . Human-Robot Interactions
IR . . . . . . . . . . . . . . . . . . . . . . . . Infrared
LIDAR . . . . . . . . . . . . . . . . . . . . . . . . Light Detection and Ranging
MTI . . . . . . . . . . . . . . . . . . . . . . . . Moving Target Indicator
SAM . . . . . . . . . . . . . . . . . . . . . . . . Surface To Air Missile
SAR . . . . . . . . . . . . . . . . . . . . . . . . Synthetic Aperture Radar
SM . . . . . . . . . . . . . . . . . . . . . . . . Sensor Management
SNR . . . . . . . . . . . . . . . . . . . . . . . . Signal-to-Noise Ratio
UAV . . . . . . . . . . . . . . . . . . . . . . . . Unmanned Airborne Vehicle
UGV . . . . . . . . . . . . . . . . . . . . . . . . Unmanned Ground Vehicle
USV . . . . . . . . . . . . . . . . . . . . . . . . Unmanned Submersible Vehicle
xix
List of Mathematic Abbreviations
AP . . . . . . . . . . . . . . . . . . . . . . . . Assignment Problem
BB . . . . . . . . . . . . . . . . . . . . . . . . Branch+Bound
CLF . . . . . . . . . . . . . . . . . . . . . . . . Closed-loop Feedback (Control)
DP . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Program(ming)
FA . . . . . . . . . . . . . . . . . . . . . . . . False Alarm (Cost)
HMM . . . . . . . . . . . . . . . . . . . . . . . . Hidden Markov Model
IP . . . . . . . . . . . . . . . . . . . . . . . . Integer Program(ming)
KL . . . . . . . . . . . . . . . . . . . . . . . . Kullback-Liebler (Distance)
LP . . . . . . . . . . . . . . . . . . . . . . . . Linear Program(ming)
MABP . . . . . . . . . . . . . . . . . . . . . . . . Multi-armed Bandit Problem
MAP . . . . . . . . . . . . . . . . . . . . . . . . Maximum A Posteriori
MD . . . . . . . . . . . . . . . . . . . . . . . . Missed Detection (Cost)
MDP . . . . . . . . . . . . . . . . . . . . . . . . Markov Decision Process
MILP . . . . . . . . . . . . . . . . . . . . . . . . Mixed Integer Linear Program(ming)
ML . . . . . . . . . . . . . . . . . . . . . . . . Maximum Likelihood
MPC . . . . . . . . . . . . . . . . . . . . . . . . Model Predictive Control(er)
MSE . . . . . . . . . . . . . . . . . . . . . . . . Mean-Squared Error
OLFC . . . . . . . . . . . . . . . . . . . . . . . . Open-Loop Feedback Control
PBVI . . . . . . . . . . . . . . . . . . . . . . . . Point-Based Value Iteration
PG . . . . . . . . . . . . . . . . . . . . . . . . Policy-Graph (Decision-Tree)
POMDP . . . . . . . . . . . . . . . . . . . . . . . . Partially Observable Markov Decision Process
PWLC . . . . . . . . . . . . . . . . . . . . . . . . Piece-wise Linear Convex
RH . . . . . . . . . . . . . . . . . . . . . . . . Receding Horizon (Control)
RMS . . . . . . . . . . . . . . . . . . . . . . . . Root Mean Square
SCP . . . . . . . . . . . . . . . . . . . . . . . . Stochastic Control Problem
SDP . . . . . . . . . . . . . . . . . . . . . . . . Stochastic Dynamic Program(ming)
TSP . . . . . . . . . . . . . . . . . . . . . . . . Travelling Salesman Problem
w.l.o.g. . . . . . . . . . . . . . . . . . . . . . . . . without loss of generality
w.r.t. . . . . . . . . . . . . . . . . . . . . . . . . with respect to
xx
1
Chapter 1
Introduction
1.1 Problem Description
Recent improvements in the hardware capabilities of robotic vehicles have led to increas-
ing use of these devices in such applications as defense, search and rescue, homeland
security, environmental monitoring and video surveillance. To extend and enhance these
applications, new algorithms are required to control the behavior of teams of robots and
to allow human operators to monitor and control them. Including a limited amount of
human input into the decision-making process allows for more robust performance in
accomplishing mission objectives without subjecting humans to unbearable stress and
fatigue. In this dissertation, we address a class of problems that models search and
exploitation missions by one or more semi or fully autonomous vehicles (UAVs, USVs
or UGVs) with heterogeneous sensing capabilities. We focus on how to control vehicle
sensors to discover as many objects as possible within a fixed time-window with limited
human guidance. A “mission” is defined as a set of tasks to be accomplished within a
known, fixed time-frame, the “mission time” in a fixed-size space, the “mission space”.
There are numerous applications for autonomous search and exploitation techniques.
In a military setting, UAVs can be tasked to explore a hostile area and identify locations
of Surface-to-Air-Missiles (SAM) sites that are threatening to piloted aircraft. In a
search and rescue scenario, robotic vehicles can be used to search millions of square miles
of ocean looking for sailors lost at see. Unmanned vehicles can be used for environmental
2
monitoring during forest fires or chemical spills. On Mars, rovers are used to explore
an area studying geological features searching for signs of water and potentially past or
present life. Other applications include urban law-enforcement and video-surveillance
in airports.
In this dissertation we are interested in making use of imprecise sensors to sense
locations in the mission space where little information is known. We seek to exploit this
noisy data to infer information about the identity of the objects we have sensed and to
determine what to observe next in a near-optimal fashion.
The noisy sensors on sensor platforms (robotic vehicles) are subject to resource
constraints (e.g., the duty-cycle of a radar or the speed with which a camera can be
pointed and focused). Resource constraints can be either cumulative across time or
periodic in nature. In general, we assume that sufficient computational power is available
to process sensor information as quickly as it can be captured, so processing power is
not a significant constraint.
We model the mission-space as a set of N locations where each location may contain
objects of unknown type or be empty. The dyamic state of this problem is the collection
of object types across the mission space at each time. This state is not directly observ-
able, so sensors are used to make noisy observations of these locations. We combine
information over time in a Bayesian framework to estimate the conditional probability
of the state given past observations, which is a sufficient statistic for control (aka the
information state or belief-state). Adaptive SM is then posed as the control problem of
determining the locations to examine with the set of available sensor resources at each
discrete time, as a function of the belief state, in order to discover as much information
as possible about the underlying problem state.
We leverage techniques from stochastic control theory and combinatorial optimiza-
tion theory to develop near-optimal control policies that adapt to information that is
3
learned throughout a mission. One central theme of this work is to address the question
of the optimal balance between search (exploration) versus exploitation: how to charac-
terize the optimal balance between spending time learning new information by searching
new locations for hitherto unknown objects (exploration) versus spending time making
use of the information already available to characterize the identity of objects known
to exist in pre-determined locations. This trade-off is a question of how to optimally
partition the allocation of a scarce resource across multiple tasks that compete for this
resource. This type of optimization problem is challenging because there is a combina-
torial action space, an imperfectly observed state space and because we are interested
in adaptive feedback strategies (dynamic optimization techniques) that keep track of all
possible sensing outcomes and future courses of action.
We assume a centralized control architecture. Thus a central controller coordinates
the actions of each robot actor that participates in the mission, and the controller
has access to all of the information from each of the autonomous vehicles without large
communication bottlenecks. The central controller must 1) compute feasible search plans
and object follow-up actions and 2) communicate these plans to the autonomous vehicles
without undue delays so that each vehicle can execute its part of the plan. Because
robots have heterogeneous capabilities, different roles for the vehicles will emerge from
a control algorithm that tries to optimally manage the resources of individual vehicles.
In the controls community, a large distinction is made between a myopic (aka greedy
or short-sighted) control/planning strategy versus a non-myopic (aka far-sighted or opti-
mal) control strategy. While it is possible that a myopic strategy could be optimal, this
is frequently not the case except in special circumstances. Another distinction is made
between an open-loop (non-adaptive) algorithm and a closed-loop (adaptive) algorithm.
An open-loop algorithm follows a sequence of actions, whereas a closed-loop algorithm
is capable of dynamically changing its actions based on the information collected. This
4
dissertation focuses on algorithms that are both non-myopic and adaptive. While these
algorithms are generally the most computationally demanding, they have the highest
performance.
A real-time control algorithm needs planning information at a rate of 1–100 Hz,
and only so much calculation can be completed before sensing decisions must be made.
Because of the combinatorial nature of the problem, computing every possible place that
sensors can look at every possible future time-instant and considering every possible
action that each sensor can take at each of these places is hopelessly complicated. We
develop algorithms that can rapidly search the decision space to compute desired control
actions in reasonable computation time. We make the assumption that the state of each
location is statistically independent of the states of other locations, which helps us
decompose the Stochastic Control Problem for Sensor Management/Sensor Resource
Allocation (SM) into subproblems of a tractable size.
Most autonomous systems require human guidance and supervision. In order to
explore proper roles for humans and automata in an SM system, we partition all of
the tasks necessary for SM into subsets and consider which tasks should be accom-
plished by a machine and which tasks should be accomplished by humans. We propose
semi-autonomous control algorithms that incorporate human input on a high level and
automated (machine) decision-making on a lower level. We discuss multiple candidate
models for the best way of coordinating between human and automata with the goal of
developing a rubric for the most-important activities that a human actor/operator can
perform while interacting with a semi-autonomous system.
1.2 Dissertation Scope and Contributions
In this dissertation we develop new algorithms for control of robots performing information-
seeking missions in unknown environments. These algorithms can be used to control
5
multiple, multi-modal, resource-constrained sensors on autonomous vehicles by specify-
ing where and how they should be used to maximize one of several possible performance
metrics for the mission. The goals of the mission are application-dependent, but at the
very least they will include accurately locating, observing and classifying objects in the
mission space.
The SM control problem can be formulated as a Partially Observed Markov Decision
Problem (POMDP), but its exact solution requires excessive computation. Instead, we
use Lagrangian Relaxation techniques to decompose the original SM problem hierarchi-
cally into POMDP subproblems with coupling constraints and a master problem that
coordinates the prices of resources. To this end, this dissertation makes the following
contributions:
• Develops a Column Generation algorithm that creates mixed strategies for SM and
implements this algorithm in fast, C language code. This algorithm creates sensor plans
that near-optimally solve an approximation of the original SM problem, but generates
mixed strategies that may not be feasible. The output strategies are programmatically
visualized in MATLAB.
• Develops alternatives for receding horizon control using these approximate, mixed strate-
gies output from the Column Generation routine, and evaluates their performance using
a fractional factorial design of experiments based on the above software.
• Extends previous results in SM for classification of stationary objects to allow Markov
dynamics and time varying visibility, obtaining new lower bounds characterizing achiev-
able performance for these problems. These lower bounds can be used to develop receding
horizon control strategies.
• Develops new approaches for solution of dynamic search problems with continuous ac-
tion and observation spaces that are much faster than previous optimal results, with
near-optimal performance. We perform simulations of our algorithms in MATLAB and
compare our results with those of the optimal algorithm from [Bashan et al., 2007,Bashan
et al., 2008]. Our algorithm performs similarly to theirs but can be used for problems
with non-uniform priors.
• Designs a game to explore human supervisory control of automata controlled by our
6
algorithms, in order to explore the effects of different levels of information support on
the quality of supervisory control.
1.3 Dissertation Organization
The structure of the remainder of this dissertation is as follows: Chapter 2 is devoted
to presenting a literature survey and background material that is pertinent to the SM
problem. Chapter 2 also reviews theory from [Casta˜n´on, 2005b] that underlies the
theoretical foundations of this dissertation’s results. Chapter 3 builds upon the results
of Chapter 2 and discusses a RH algorithm for near-optimal SM in a scenario where
objects have static state and sensor platforms are unconstrained in terms of where they
look and when (no motion constraints). Chapter 4 discusses new algorithms for two
extensions to the problem formulation of Chapter 3: 1) objects that can arrive and
depart with a Markov birth-death process or 2) object visibility that is known but time-
varying. Chapter 5 considers an adaptive search problem for sensors with continuous
action and observation spaces and presents fast, near-optimal algorithms for the solution
of these problems. Chapter 6 discusses some candidate strategies for mixed human/non-
human, semi-autonomous robotic search and exploit teams and develops a game that can
be used to explore human supervisory control of robotic teams. Chapter 7 summarizes
this dissertation. Last of all, two appendices are included that provide some additional
background theory and documentation for the simulator discussed in Chapter 3.
7
Chapter 2
Background
This chapter provides both a literature survey and background material that will be
referred to in later chapters. First we review existing techniques from various fields
related to this dissertation. We describe why the algorithms presented in the litera-
ture fail to address the problem we envision to the extent that is required for a search
and exploitation system to be considered “semi-autonomous”, which we take to mean
an Autonomous Capability Level (ACL) of 6 or higher on the DoDs “UAV Roadmap
2009” [United States Department of Defense, 2009]. Section 2.2 discusses the develop-
ment of a lower bound for the achievable performance of a SM system from [Casta˜n´on,
2005b]. The last section in this chapter, Section 2.3, gives an example of our algo-
rithm for computing SM plans. The implementation of the algorithm from [Casta˜n´on,
2005b], as demonstrated by this example, is the first contribution of this dissertation.
For the interested reader, a brief review of theory pertaining to Partially Observable
Markov Decision Processes (POMDPs), the Witness Algorithm and Point-Based Value
Iteration (PBVI), Dantzig-Wolfe Decompositions and Column Generation is available
in Appendix A.
2.1 Literature Survey
Problems of search and exploitation have been considered in many fields such as Search
Theory, Information-Theory, Multi-armed Bandit Problems (MABPs), Stochastic Con-
trol, and Human-Robot Interactions. We will review relevant results in each of these
8
areas in the rest of this section.
2.1.1 Search Theory
One of the earliest examples of Sensor Management (SM) arose in the context of Search,
with application to anti-submarine warfare in the 1940’s [Koopman, 1946, Koopman,
1980]. In this context, Search Theory was used to characterize the optimal allocation
of search effort to look for a single stationary object with a single imperfect sensor.
Sensors had the ability to move spatially and allocate their search effort over time and
space. In [Koopman, 1946] a worst-case search-performance rule is derived yielding the
“Random Search Law” aka the “Exponential Detection Function” [Stone, 1977]. This
work is extended in [Stone, 1975] to handle the case of a single moving object. A survey
of the field of Search Theory is given by [Benkoski et al., 1991], which describes how
most work in this domain focuses on open-loop search plans rather than feedback control
of search trajectories.
The main problem with most Search Theory results is that the search strategies are
non-adaptive and the search ends after the object has been found. Extensions of Search
Theory to problems requiring adaptive feedback strategies have been developed in some
restricted contexts [Casta˜n´on, 1995] where a single sensor takes one action at a time.
Recent work on Search has focused on deterministic control of search vehicle tra-
jectories using different performance metrics. Baronov et al. [Baillieul and Baronov,
2010,Baronov and Baillieul, 2010] describe an information aquisition algorithm for the
autonomous exploration of random, continuous fields in the context of environmental
exploration, reconaissance and surveillance. Our focus in this thesis is on adaptive
sensor scheduling based on noisy observations, and not on control of sensor-platform
trajectories.
9
2.1.2 Information Theory for Adaptive Sensor Management
Adaptive SM has its roots in the field of statistics, in which Bayesian experiment design
was used to configure subsequent experiments that were based on observed information.
Wald [Wald, 1943,Wald, 1945] considered sequential hypothesis testing with costly ob-
servations. Lindley [Lindley, 1956] and Kiefer [Kiefer, 1959] expanded the concepts
to include variations in potential measurements. Chernoff [Chernovv, 1972] and Fe-
dorov [Fedorov, 1972] used Cramer-Rao bounds for selecting sequences of measurements
for nonlinear regression problems. Most of the strategies proposed for Bayesian experi-
ment design involve single-step optimization criteria, resulting in “greedy” (or “myopic”)
strategies that optimize bounds on the expected performance after the next experiment.
Athans [Athans, 1972] considered a two-point boundary value approach to controlling
the error covariance in linear estimators by choosing the measurement matrices. Other
approaches to adaptive SM using single-stage optimization have been proposed with al-
ternative information theoretic measures [Schmaedeke, 1993,Schmaedeke and Kastella,
1994,Kastella, 1997,Kreucher et al., 2005].
Most of the work on information theory approaches for SM is focused on tracking
objects using linear or nonlinear estimation techniques [Kreucher et al., 2005,Wong et al.,
2005,Grocholsky, 2002,Grocholsky et al., 2003] and use myopic (single stage) policies.
Myopic policies generated by entropy-gain criteria perform well in certain scenarios,
but they have no guarantees for optimality in dynamic optimization problems. Along
these lines, the dissertation by Williams provides a set of performance bounds on greedy
algorithms as compared to optimal closed-loop policies in certain situations [Williams,
2007].
10
2.1.3 Multi-armed Bandit Problems
In the 1970s [Gittins, 1979] developed an optimal indexing rule for “Multi-armed Bandit
Problems” (MABP) that is applicable to SM problems. In these approaches, different
objects are modeled as ”bandits” and assigning a sensor to look at an object is equivalent
to playing the ”bandit”, thereby changing the ”bandit’s” state. Krishnamurthy et al.
[Krishnamurthy and Evans, 2001a,Krishnamurthy and Evans, 2001b] and Washburn et
al. [Washburn et al., 2002] use MABP models to obtain SM policies for tracking moving
objects. The MABP model limits their work to policies that use a single sensor with a
single mode, so only one object can be observed at a time.
2.1.4 Stochastic Control
Stochastic control approaches to SM problems are often posed as Stochastic Control
Problems and solved using Dynamic Programming techniques [Bertsekas, 2007]. Evans
and Krishnamurthy [Krishnamurthy and Evans, 2001a] use a Hidden Markov Model
(HMM) to represent object dynamics while planning sensor schedules. Using a Stochas-
tic Dynamic Programming (SDP) approach, optimal policies are found for the cost
functions studied. While the proposed algorithm provides optimal sensor schedules for
multiple sensors, it only deals with one object.
Several authors have recently proposed approximate Stochastic Dynamic Program-
ming techniques for SM based on value function approximations or reinforcement learn-
ing [Wintenby and Krishnamurthy, 2006,Kreucher and Hero, 2006,Chong et al., 2008a,
Washburn et al., 2002,Schneider et al., 2004,Williams et al., 2005,Chong et al., 2008a,
Chong et al., 2008b]. The majority of these results are focused on the problem of track-
ing objects. Furthermore, the proposed approaches are focused on small numbers of
objects, and fail to address the range and scale of the problems of interest in this dis-
sertation. A good overview of approximate DP techniques is available in [Casta˜n´on and
11
Carin, 2008].
Bashan, Reich and Hero [Bashan et al., 2007, Bashan et al., 2008] use DP to solve
a class of two-stage adaptive sensor allocation problems for search with large numbers
of possible cells. The complexity of their algorithm restricts its application to problem
classes where every cell has a uniform prior. Similar results were obtained for an imaging
problem in [Rangarajan et al., 2007]. In this thesis, we develop a different approach that
overcomes this limitation in Ch. 5.
In [Yost and Washburn, 2000], Yost describes a hierarchical algorithm for resource
allocation using a Linear Program (LP) at the top level (the “master problem”) to
coordinate a set of POMDP subproblems in a Battle Damage Assessment (BDA) setting.
This work is similar to the approach of this dissertation, except we are concerned with
more complicated POMDP subproblems.
The problem of unreliable resource allocation is discussed in [Casta˜n´on and Wohletz,
2002,Castanon and Wohletz, 2009] in which a pool of M resources is assigned to complete
N failure-prone tasks over several stages using an SDP formulation. Casta˜n´on proposes
a receding-horizon control approach to solve a relaxed DP problem that has an execution
time nearly linear in the number of tasks involved, however this work does not handle a
partially observable state.
Most approaches for dynamic feedback control are limited in application to problems
with a small number of sensor-action choices and simple constraints because the algo-
rithms must enumerate and evaluate the various control actions. In [Casta˜n´on, 1997],
combinatorial optimization techniques are integrated into a DP formulation to obtain
approximate SDP algorithms that extend to large numbers of sensor actions. Subsequent
work in [Casta˜n´on, 2005b] derives an SDP formulation using partially observed Markov
decision processes (POMDPs) and obtains a computable lower bound to the achievable
performance of feedback strategies for complex multi-sensor, SM problems. The lower
12
bound is obtained by a convex relaxation of the original combinatorial POMDP using
mixed strategies and averaged constraints. However, the results in [Casta˜n´on, 2005b]
do not specify algorithms with performance close to the lower bound (see Section 2.2).
This dissertation describes such an algorithm in Ch. 3 and then proposes theoretical
extensions to this algorithm in Ch. 4.
2.1.5 Human-Robot Interactions and Human Factors
The use of simulation as a technique to explore the best means of Human-Robot Inter-
action (HRI) in teams with multiple robots per human is the subject of [Dudenhoeffer
et al., 2001]. Questions of human situational awareness, mode awareness (what the
robot is currently doing), and mental model formulation are discussed.
In [Raghunathan and Baillieul, 2010], a search game involving the identification
of roots of random polynomials is presented. The paper analyses the search versus
exploration trade-off made by players and develops Markov models that emulate the
style of play of the 18 players involved in the experiments indistinguishably w.r.t. a
performance metric.
The SAGAT tool for measuring situational awareness has gained wide acceptance in
the literature [Endsley, 1988]. This tool is important for estimating an operator’s ability
to adequately control a team of robots and avoid mishaps.
In the M.S. thesis of [Anderson, 2006], various hierarchical control structures are
described for the human control of multiple robots. A game of tag is played inside
a maze by two teams of three robots controlled at 5 levels of autonomy, and various
metrics for human and robot performance are studied. Robots in this work are either
tele-operated or move myopically, and sensor measurements are noiseless within a certain
range.
In [Scholtz, 2002], possible HRI interactions are divided up into three categories, and
13
it is speculated that each category of interaction requires different types of information
and a different interface. This work suggests that a system with multiple levels of
autonomy requires different kinds kind of interfaces according to its mode of operation
and needs to be able to transition between them without confusing the human operator.
In [Cummings et al., 2005], Cummings discusses a list of issues that need to be
addressed to achieve the military’s vision for Network Centric Warfare (NCW). The
author states that to improve system performance, systems must move from a paradigm
of Management by Consent (MBC) to Management by Exception (MBE).
A system for predicting human performance in tasking multiple robotic vehicles is
discussed in [Crandall and Cummings, 2008]. Human behavior is predicted by generating
several stochastic models for 1) the amount of time humans need to issue commands
and 2) the amount of time humans need to switch between tasks. Several performance
metrics are also presented for situational awareness, for the effectiveness of an operator’s
communications with a robot, and for the success of robot behavior while untasked.
These references for HRI focus on human situational awareness, performance metrics,
and various control strategies for human control of automata in simple environments.
These references do not investigate the question of an optimal means of HRI in a semi-
autonomous system in which robots with noisy, resource-constrained sensors are used to
explore an unknown, partially-observable and dynamic environment using non-myopic
and adaptive search strategies.
2.1.6 Summary of Background Work
As the above discussion indicates, the research to date has focused on only parts of
the problem of interest in this dissertation. The methods used in the existing body
of research need to be merged and unified in an intelligent fashion such that a semi-
autonomous search, plan, and execution system is created that behaves cohesively and
14
in a non-myopic, adaptive fashion.
In this dissertation, we develop and implement algorithms for the efficient compu-
tation of adaptive SM strategies for complex problems involving multiple sensors with
different observation modes and large numbers of potential object locations. The al-
gorithms we present are based on using the lower bound formulation from [Casta˜n´on,
2005b] as an objective in a RH optimization problem and on developing techniques for
obtaining feasible sensing actions from (generally infeasible) mixed strategy solutions.
These algorithms support the use of multiple, multi-modal, resource-constrained, noisy
sensors operating in an unknown environment in a search and classification context.
The resulting near-optimal, adaptive algorithms are scalable to large numbers of tasks,
and suitable for real-time SM.
2.2 Sensor Management Formulation and Previous Results
In this section, we discuss the SM stochastic control formulation and results presented
in [?] which serve as the starting point for our work in subsequent chapters. We extend
the notation of [?] to include multiple sensors and additional modes such as search.
2.2.1 Stationary SM Problem Formulation
Assume there are a finite number of locations 1, . . . , N, each of which may have an object
with a given type, or which may be empty. Assume that there is a set of S sensors, each
of which has multiple sensor modes, and that each sensor can observe one and only one
location at each time with a selected mode. This assumption can be relaxed, although
it introduces additional complexity in the exposition and the computation.
Let xi ∈ {0, 1, . . . , D} denote the state of location i, where xi = 0 if location i
is unoccupied, and otherwise xi = k > 0 indicates location i has an object of type
k. Let πi(0) ∈ ℜD+1
be a discrete probability distribution over the possible states for
15
0 1 2
0
0.5
1
(1,1)
x1
Pr(x
1
)
0 1 2
0
0.5
1
(1,2)
x2
Pr(x2
)
Representative beliefs for N locations
0 1 2
0
0.5
1
(1,3)
x3
Pr(x3
)
0 1 2
0
0.5
1
(2,1)
x4Pr(x
4
)
0 1 2
0
0.5
1
(2,2)
x5
Pr(x5
)
0 1 2
0
0.5
1
(2,3)
x6
Pr(x6
)
0 1 2
0
0.5
1
(3,1)
x
7
Pr(x
7
)
0 1 2
0
0.5
1
(3,2)
x
8Pr(x8
)
0 1 2
0
0.5
1
(3,3)
x
9
Pr(x9
)
Figure 2·1: Illustrative example set of prior-probabilities πi(0) using a
MATLAB “stem” plot for case where N = 9 and D = 2. Assuming
objects are classified into 3 types, the Maximum-Likelihood estimate of
locations xi with i ∈ {3, . . . , 6, 8, 9} is type 0 (empty). The ML estimate
of xi for i ∈ {1, 2, 7} is type 1.
the ith
location for i = 1, . . . , N where D ≥ 2. Assume that the random variables
xi, i = 1, . . . , N, are mutually independent. If such an independence assumption is
not assumed, then it is possible to learn state information about location i from a
measurement of location j (with i = j). Fig. 2·1 shows a set of probability mass
functions involving N = 9 locations with D = 2 arranged in a 2D grid.
Let there be s = 1, . . . , S sensors, each of which has m = 1, . . . , Ms possible modes
of observation. We assume there is a series of T discrete decision stages where sensors
can select which location to measure, where T is large enough so that all of the sensors
can use their available resources. At each stage, each sensor can choose to employ one
and only one of its modes on a single location to collect a noisy measurement concerning
the state xi at that location. Each sensor s has a limited set of locations that it can
16
observe, denoted by Os ⊆ {1, . . . , N}. A sensor action by sensor s at stage t is a pair:
us(t) = (is(t), ms(t)) (2.1)
consisting of a location to observe, is(t) ∈ Os, and a mode for that observation, ms(t).
Sensor measurements by sensor s with mode m at stage t, ys,m(t) are modeled as be-
longing to a finite set ys,m(t) ∈ {1, . . . , Ls}. The conditional probability of the measured
value is assumed to depend on the sensor s, sensor mode m, location i and on the true
state at the location, xi, but not on the states of other locations. Denote this condi-
tional probability as P(ys,m(t)|xi, i, s, m). We assume that this conditional probability
given xi is time-invariant, and that the random measurements ys,m(t) are conditionally
independent of other measurements yσ,n(τ) given the states xi, xj for all sensors s, σ
and modes m, n provided i = j or τ = t.
Each sensor has a limited quantity of Ri resources available for measurements over
the T stages of time. Associated with the use of mode m by sensor s on location i is
a resource cost rs(us(t)) to use this mode, representing power or some other type of
resource required to use this mode from this sensor.
T−1
t=0
rs(us(t)) ≤ Rs ∀ s ∈ [1, . . . , S] (2.2)
This is a hard constraint for each realization of observations and decisions.
Let I(t) denote the history of past sensing actions and measurement outcomes up to
and including stage t − 1:
I(t) = {(us(τ), ys,m(τ)), s = 1, . . . , S; τ = 0, . . . , t − 1}
As is frequently the case when working with POMDPs, we make use of the idea of the
information history as a sufficient statistic for the state of the system/world x.
17
Under the assumption of conditional independence of measurements and indepen-
dence of individual states at each location, the joint probability π(t) = P(x1 = k1, x2 =
k2, . . . , xN = kN |I(t)) can be factored as the product of belief-states (marginal condi-
tional probabilities) for each location. Denote the conditional probability (belief-state)
at location i as πi(t) = p(xi|I(t)). The probability vector π(t) is a sufficient statistic for
all information that is known about the state of the N locations up until time t.
When a sensor measurement is taken, the belief-state π(t) is updated according to
Bayes’ Rule. A measurement of location i with the sensor-mode combination us(t) =
(i, m) at stage t that generates observable ys,m(t) updates the belief-vector as:
πi(t + 1) =
diag{P(ys,m(t)|xi = j, i, s, m)}πi(t)
1T
diag{P(ys,m(t)|xi = j, i, s, m)}πi(t)
(2.3)
where 1 is the D + 1 dimensional vector of all ones. Eq. 2.3 captures the relevant
information dynamics that SM controls. For generality, the index i in the likelihood
function specifies that the sensor statistics could vary on a location-by-location basis.
Of prime importance is the fact that using π(t) as a sufficient statistic along with
Eq. 2.3, we are able to combine a priori probabilities represented by π(t) with conditional
probabilities given by sensor measurements in order to form posterior probabilities,
π(t + 1), in recursive fashion: beliefs can be maintained and propagated.
In addition to information dynamics, there are resource dynamics that characterize
the available resources at stage t. The dynamics for sensor s are given as:
Rs(t + 1) = Rs(t) − rs(us(t)); Rs(0) = Rs (2.4)
These dynamics constrain the admissible decisions by a sensor, in that it can only use
modes that do not use more resources than are available.
An adaptive feedback strategy is a closed-loop policy or decision-making rule that
maps collected information sets up until stage t, i.e. the sets Ii(τ) ∀ i ∈ [1, . . . , N], τ ∈
18
[0, . . . , t − 1] to choose actions for stage t: γ : I(t) → U(t). Define a local strategy, γi,
as an adaptive feedback strategy that chooses actions for location i purely based on the
information sets Ii(t), which is to say based purely on the history of past actions and
observations specific to location i.
Given the final information, I(T), the quality of the information collected is measured
by making an estimate of the state of each location i given the available information.
Denote these estimates as vi ∀ i = 1, . . . , N. The Bayes’ cost of selecting estimate vi
when the true state is xi is denoted as c(xi, vi) ∈ ℜ with c(xi, vi) ≥ 0. The objective of
the SM stochastic control formulation is to minimize:
J =
N
i=1
E[c(xi, vi)] (2.5)
by selecting adaptive sensor control policies and final estimates subject to the dynamics
of Eq. 2.3 and the constraints of Eq. 2.2 and Eq. 2.4.
This problem was solved using dynamic programming in [Casta˜n´on, 2005a] as follows:
Define the optimal value function V (π, R, t) that is the optimal solution to Eq. 2.5
subject to Eq. 2.2 and Eq. 2.4 when the initial information is π, and R is the vector
of current resource levels. Let R = [R1 R2 . . . RS]T
. Define πij as the jth
component
of the probability vector associated with location i. The value function is defined on
SN
× ℜS
+. Let U represent the set of all possible sensor actions and define U(R) ⊂ U
as the set of feasible actions with resource level R. The value function V (π, R, t) must
be recursively related to V (π, R, t + 1), its value one time step earlier, according to
Bellman’s Equation [Bellman, 1957]:
V (π, R, t + 1) = min
N
i=1
min
vi∈X
j=0,...,D
c(j, vi)πij, (2.6)
min
u∈U(R)
E
y
{V (T(π, u, y), R − Ru, t)}
19
where Ru = [r1(u1(t)) r2(u2(t)) . . . rS(uS(t))]T
and T(.) is an operator describing the
belief dynamics. T is the identity mapping for information states πj(t) ∀ j except for
{j | j = is ∀ s ∈ [1, . . . , S]}, the set of sensed locations. For a sensed location i, T maps
πi(t) to πi(t + 1) with Eq. 2.3. The expectation in Eq. 2.6 is given by:
E
y
{V (T(π, u, y), R − Ru, t)} =
y∈Ys,m ∀ s
P(y|I(k), u)V (T(π, u, y), R − Ru, t)
where Ys,m is the (discrete) set of possible observations (symbols) for sensor s with
mode m, and y is a vector of measurements with one measurement per sensor. (The
mode m for each sensor s is determined by the vector-action u in the minimization).
The minimization is done over S dimensions because there are S sensors.
To initialize the recursion, the optimal value function when the number of stages to
go t is zero is determined by choosing the classification decision vi without any additional
measurements, as
V (π, R, 0) =
N
i=1
min
vi∈X
j=0,...,D
c(j, vi)πij (2.7)
Note that this minimization can be done independently for each location i. The optimal
value of Eq. 2.5 can be computed using Eq. 2.6 - Eq. 2.7.
The problem with the DP equation Eq. 2.6 as it currently stands is that whereas the
measurement and classification costs of the N locations in the problem initially start
off decoupled from each other (c.f. Eq. 2.7), the DP recursion does not preserve the
decoupling from one stage to the next. Therefore in general the best choice of action for
location i with t stages-to-go will depend on the amount of resources from each of the
different sensors that have been expended on other locations during the previous stages.
This leads to a very large POMDP problem with a combinatorial number of actions to
consider and an underlying belief-state of dimension (D + 1)N
that is computationally
intractable unless there are few locations.
20
In [Casta˜n´on, 2005b], the above problem is replaced by a simpler problem that
provides a lower bound on the optimal cost, by expanding the set of admissible strategies,
replacing the constraints of Eq. 2.2 with “soft” constraints:
E[
T−1
t=0
rs(us(t))] ≤ Rs ∀ s ∈ [1 . . . S] (2.8)
Note that every admissible strategy that satisfies Eq. 2.2 also satisfies Eq. 2.8. After
relaxing the resource constraints, there is just one constraint per sensor (instead of one
constraint for every possible realization of actions and observations per sensor). These
constraints are constraints on the average resource use one would expect to spend over
the planning horizon.
To solve the relaxed problem, [Casta˜n´on, 2005b] proposed incorporation of the soft
constraints in Eq. 2.8 into the objective function using Lagrange multipliers λs for each
sensor s and using Lagrangian Relaxation. Now the measurement and classification costs
for a pair of locations are only related through the values of the Lagrange multipliers
associated with the sensors they use in common. Therefore given the price of time for
the set of sensors that will be used in the optimal policy to make measurements on a
pair of locations, the classification and measurement costs for those two locations are
decoupled in expectation! Once we can partition resources between a pair of locations,
we can do so for N locations. The augmented objective function is:
¯Jλ = J +
T−1
t=0
S
s=1
λs E[rs(us(t))] −
S
s=1
λsRs (2.9)
Define an admissible strategy, γ, as a function which maps an information state,
π(t), to a feasible measurement action (or to a null action if sufficient resources are
unavailable). Define Γ as the set of all possible γ. Because the measurements and
possible sensor actions are finite-valued, the set of possible SM strategies Γ is also finite.
21
Let Q(Γ) denote the set of mixed strategies that assign probability q(γ) to the choice of
strategy γ ∈ Γ.
A key result in [Casta˜n´on, 2005b] was that when the optimization of Eq. 2.9 was
done over mixed strategies for given values of Lagrange multipliers, λs, optimization
problem in Eq. 2.9 decoupled into independent POMDPs for each location, and the
optimization could be performed using local feedback strategies, γi, for each location i.
Write ΓL for the set of all local feedback strategies. These POMDPs have an underlying
information state-space of dimension D + 1, corresponding to the number of possible
states at a single location, and can be solved efficiently. Decomposition is essential to
make the problem tractable.
The pair of figures Fig. 2·2 and Fig. 2·3 demonstrate what a stereotypical POMDP
solution (for a toy problem) looks like. These figures describe the optimal set of solution
hyperplanes and the optimal policy for SM on a set of locations given a vector of prices
for resource costs (i.e. assuming for the moment that we already know what the optimal
resource prices are). The brown and magenta hyperplanes (nodes 2 and 6 w.r.t. Fig. 2·3
are very nearly parallel to the neighboring hyperplanes and therefore two of the three
hyperplanes with node id’s 1–3 are very nearly redundant (dominated) and the same
goes for node id’s 5–7. The smaller the extent of a hyperplane in the concave (for cost
functions) hull of the set of hyperplanes, the less role it has to play in the optimal
value function. In this example the cost of the ‘Mode1’ action was 0.1 units and that
of ‘Mode2’ was 0.18 units. If the ‘Mode2’ cost is changed to 0.2 units then there
are only 7 hyperplanes in the optimal set of hyperplanes (i.e. the value function) and
‘Mode2’ is not used at all. These results are relative to the prior probability and sensor
statistics. The alpha vectors (i.e. hyperplane coefficients) and actions associated with
each hyperplane (equivalently decision-tree node) can be seen in the inset below the
value function. The state enumeration was X={‘non-military’,‘military’}. The alpha
22
vector coefficients give the classification + measurement cost of a location having each
one of these states w.r.t. this enumeration.
In Fig. 2·3, assume hypothesis ‘H2’ corresponds to ‘Declare military vehicle’ and
‘H1’ is the null hypothesis (‘Declare non-military vehicle’). In this policy the arrows
leaving each node on top represent observation ‘y1’ (‘non-military’), and the arrows on
bottom represent ‘y2’ (‘military’). The 9 nodes on the left of this policy correspond
to the 9 hyperplanes that make up Fig. 2·2. If there had been more than two possible
actions and two possible observations, then after a few stages there could easily have
been thousands of distinct nodes in the initial stage! This figure uses a model with a
dummy/terminal capture state, so it is possible to stop sensing at any time.
The use of two states, one sensor with two modes, two observations (the same type of
observations for both modes) for a horizon 5 (4 sensing actions+classification) POMDP
results in 9 hyperplanes based on the particular cost structure used: ‘Mode1’ costs
0.1 units and ‘Mode2’ costs 0.18 units. False alarms (FAs) and Missed detections (MDs)
each cost 1 unit. For problems with several sensors, 4 possible states and 3 modes
per sensor with 3 possible observations, there are frequently on the order of 500–1000
hyperplanes for a horizon 5 POMDP. Whereas originally this work was done using the
Witness Algorithm in a modified version of pomdp-solve-5.3 [Cassandra, 1999], this
algorithm is slow when solving 1000’s (later millions) of POMDPs in a loop. Therefore,
by default we use the Finite Grid algorithm (PBVI) within our customized version of
pomdp-solve-5.3 with 500–1000 belief-points to solve POMDPs. This allows a POMDP
of this size to be solved within about 0.2 sec on a single-core, Intel P4, 2.2 GHz, Linux
machine.
There is a trade-off between correctly detecting objects and engendering false alarms.
Fig. 2·4 illustrates how the overall classification cost increases as the ratio of the MD:FA
cost increases (from 1:1 through 80:1) for 3 resource levels {300, 500, 700} according to
23
Figure 2·2: Hyperplanes from the Value Iteration algorithm that accom-
pany Fig. 2·3.
Figure 2·3: Policy graphs for the optimal classification of a location with
state {‘non-military’,‘military’}, two possible actions {‘Mode1’,‘Mode2’},
and two possible observations {‘y1’,‘y2’}.
24
two cases: in the first case the resources are all available to a single sensor that supports
two modes of operation {‘mode1’, ‘mode2’}, and in the second case the resources are
equally divided between two identical sensors that each support the same two modes of
operation. Partitioning resources in this way adds an additional constraint that increases
classification cost. We also observe that the larger the quantity of resources available,
the larger the discrepancy between the S = 1, M = 2 case and the S = 2, Ms = 1 ∀ s
case.
In order to have meaningful POMDP solutions, we must have a way of coordinating
the sensing activities between various locations. Lagrange multipliers and Lagrangian
Relaxation provide this coordinating mechanism. Writing our policies for SM in terms
of mixed strategies allows linear programming techniques to be used for Lagrangian
Relaxation. To this end, we write Eq. 2.9 in terms of mixed strategies:
˜J∗
λ = min
γ∈Q(ΓL)
E
γ
N
i=1
c(xi, vi) +
T−1
t=0
S
s=1
λsrs(us(t)) −
S
s=1
λsRs (2.10)
where the strategy γ maps the current information state π(t) to the choice of us(t) ∀ s.
At stage T the strategy γ also determines the classification decisions vi ∀ i. On account
of the fact that we chose a relaxed form of resource constraint in Eq. 2.2, we know that
the actual optimal cost must be lower bounded by Eq. 2.10 because we have expanded
the space of feasible actions. This identification leads to the inequality:
J∗
≥ sup
λ1,...,λS≥0
˜J∗
λ1,...,λS
(2.11)
As shown in [Casta˜n´on, 2005a], Eq. 2.11 is the dual of the LP:
min
q∈Q(ΓL)
γ∈ΓL
q(γ) E
γ
[J(γ)] (2.12)
25
Figure 2·4: This figure is a plot of expected cost (measure-
ment+classification) versus MD for 3 different resource levels. The solid (blue)
line gives the performance when the resources are pooled into one sensor and
the dashed (red) line gives the performance when the resources are split across
two sensors.
26
γ∈ΓL
q(γ) E
γ
N
i=1
T−1
t=0
rs(us(t)) ≤ Rs ∀ s ∈ [1, . . . , S] (2.13)
γ∈ΓL
q(γ) = 1 (2.14)
where we have one constraint for each of the S sensor resource pools and an additional
simplex constraint in Eq. 2.14 which ensures that q ∈ Q(ΓL) forms a valid probability
distribution.
This is a large LP, where the number of possible variables are the strategies in
ΓL. However, the total number of constraints is S + 1, which establishes that optimal
solutions of this LP are mixtures of no more than S + 1 strategies. Thus, one can use a
Column Generation approach [Gilmore and Gomory, 1961,Dantzig and Wolfe, 1961,Yost
and Washburn, 2000] to quickly identify an optimal mixed strategy that solves the relaxed
(i.e. approximate) form of our SM problem. (See Appendix A.3 for an overview of
Column Generation). To use Column Generation with the LP formulation Eq. 2.12 -
Eq. 2.14, we break the original problem hierarchically into two new sets of problems
that are called the master problem and subproblems. There is one POMDP subproblem
for each location. The master problem consists of identifying the appropriate values of
the Lagrange multipliers, λs ∀ s, to determine how resources should be shared across
locations, and the subproblems consist of using these Lagrange multipliers to compute
the expected resource usage and expected classification cost for each of the N locations.
See Fig. 2·5 for a pictorial representation.
Column Generation works by solving Eq. 2.12 and Eq. 2.13, restricting the mixed
strategies to be mixtures of a small subset Γ′
L ⊂ ΓL. The solution of the restricted LP
has optimal dual prices λs, s = 1, . . . , S. Using these prices, one can determine a corre-
sponding optimal pure strategy by minimizing Eq. 2.9, which the results in [Casta˜n´on,
2005b] show can be decoupled into N independent optimization problems, one for each
location. Each of the subproblems is solved as a POMDP using standard algorithms,
27
Figure 2·5: Schematic showing how the master problem coordinates the ac-
tivities of the POMDP subproblems using Column Generation and Lagrangian
Relaxation. After the master problem generates enough columns to find the
optimal values for the Lagrange multipliers, there is no longer any benefit to
violating one of the resource constraints and the subproblems (with augmented
costs) are decoupled in expectation.
such as Point-Based Value Iteration (PBVI), Appendix A.2 [Pineau et al., 2003], to de-
termine the best pure strategy γ1 for these prices. Solving all of the subproblems allows a
new column to be generated by providing values for the expected classification cost and
expected resource utilization for a given set of sensor prices λs; these values become the
coefficients in the new column in the (Revised) Simplex Tableau of the master problem.
The column that is generated will be a pure strategy that is not already in the basis of
the LP (or else the master problem would have converged). If the best pure strategy, γ1,
for the prices, λs ∀ s ∈ [1, . . . , S], is already in the set Γ′
L, then the solution of Eq. 2.12
and Eq. 2.13 restricted to Q(Γ′
L) is an optimal mixed strategy over all of Q(ΓL), and
the Column Generation algorithm terminates. Otherwise, the strategy γ1 is added to
the admissible set Γ′
L, and the iteration is repeated. The solution to this algorithm is
a set of mixed strategies that achieve a performance level that is a lower bound on the
original SM optimization problem with hard constraints.
2.2.2 Addressing the Search versus Exploitation Trade-off
As one contribution of this dissertation, we address how to non-myopically trade-off be-
tween spending time searching for objects versus spending time acting on them. This is
28
P(y1,1(t)|xi, u1(t)) = P(y1,2(t)|xi, u2(t)) = P(y1,3(t)|xi, u3(t)) =
empty
car
truck
SAM
0
B
B
B
B
B
@
0.92 0.04 0.04
0.08 0.46 0.46
0.08 0.46 0.46
0.08 0.46 0.46
1
C
C
C
C
C
A
o1 o2 o3
0
B
B
B
B
B
@
0.95 0.03 0.02
0.05 0.85 0.10
0.05 0.10 0.85
0.05 0.10 0.85
1
C
C
C
C
C
A
o1 o2 o3
0
B
B
B
B
B
@
0.97 0.02 0.01
0.02 0.95 0.03
0.02 0.90 0.08
0.02 0.03 0.95
1
C
C
C
C
C
A
o1 o2 o3
Table 2.1: Example of expanded sensor model for an SEAD mission
scenario where the states are {‘empty’, ‘car’, ‘truck’, ‘SAM’} and the
observations are ys,m = {o1 = ‘see nothing’, o2 = ‘civilian vehicle’, o3
= ‘military vehicle’} ∀ s, m. This setup models a single sensor with
modes {u1 = ‘search’, u2 = ‘mode1’, u3 = ‘mode2’} where mode2 by
definition is a higher-quality mode than mode1. Using mode1, trucks can
look like SAMs, but cars do not look like SAMs.
an easy generalization to make: object confusion matrices can be used to allow inferenc-
ing based on detection only. To accomplish this, we augment the sensor model used in
example of [Casta˜n´on, 2005a] with a ‘search’ action that supports a low-res mode of op-
eration (with low resource demand) designed for object detection but incapable of object
classification. Our sensor observation models can be made non-informative w.r.t. object
type by setting the conditional probability of an observation for each type of object to be
the same as in Table 2.1. As a simple starting point for the new sensor model, we consider
three possible values of observations : {o1 = ‘see nothing’, o2 = ‘uninteresting object’,
o3 = ‘interesting object’} that have known statistics and are the result of pre-processing
and thresholding sensor data. The ‘search’ action effectively returns the joint probability
of P(o2 ∩ o3|xi, u1).
2.2.3 Tracing Decision-Trees
One hindrance is that the hyperplanes given by a POMDP solver that represent the
expected cost-to-go are in terms of total cost. In order to create a new column, it is
necessary to separate out the classification cost from the measurement costs. This pro-
cess is best illustrated with an example. Consider Fig. 2·6. This figure is an illustration
29
of “tracing” or “walking” a POMDP decision-tree solution to calculate expected clas-
sification costs and resource utilizations for a subproblem. The states in this example
are indexed as {‘military’, ‘truck’, ‘car’, ‘empty’}. The dot-product of the probability
vector for the current information state, in this example π(0) = [0.02 0.06 0.12 0.80]T
,
with the best hyperplane returned by Value Iteration (or the approximation PBVI)
gives the total cost for location i: Ji,total = Ji,measure +Ji,classify, from which we can calcu-
late the subproblem classification cost, Ji,classify, once we subtract out its measurement
cost, Ji,measure. To subtract out the measurement cost, we must recursively traverse the
decision-tree and sum up the (expected) cost of each potential measurement action. The
probability mixture weights in the expected cost of each action are given by the obser-
vation probabilities P(ys,m(t)|πi(t), u(t)), where u(t) is the sensor action taken. For the
particular initial probability of π(0), only actions in the set {‘wait’, ‘search’} are part
of the optimal solution (for simplicity of illustration). The numbers in blue represent
the conditional likelihood of an observation occurring, and the color of each node rep-
resents the optimal choice of action for that information state (and nearby information
states): {white=‘wait’, aqua=‘search’}. Given this decision-tree that represents the op-
timal course of action for the information state π(0), the set of possible future beliefs
and the relative likelihood of each belief occurring are shown. The possible beliefs and
likelihoods display their respective observation histories up to time t using the conven-
tion o = {y(0), . . . , y(t−1)}. In this example false alarm and missed detection costs are
equal (FA=MD), and the (time-invariant) likelihoods for the ‘search’ action are:
P(ysearch(t)|xi, ‘search’) =


0.08 0.46 0.46
0.08 0.46 0.46
0.08 0.46 0.46
0.92 0.04 0.04


In this matrix, the states xi vary along the rows, and the observations {‘o0’, ‘o1’, ‘o2’} (for
{‘see nothing’, ‘see non-military vehicle’, ‘see military vehicle’}) vary across the columns.
30
All indices are 0-based. There were 3 observations (which in general implies three child
nodes for every node in the decision-tree) for some actions, but the search action has a
uniform observation probability over all non-empty observations (all observations except
‘o0’), and therefore the latter two future node indices for search nodes (nodes that specify
search actions) are always the same. (This keeps the example tractable). For a ‘wait’
action, all three future nodes are the same because there is only one possible future belief-
state. The green terminal classification (‘declaration’) node represents the decision that
a location contains a benign object (‘truck’, ‘car’) and the gray declaration node that the
location is ‘empty’. The nodes are labeled using the scheme: ‘[nodeId nextNodeId[o0]
nextNodeId[o1] nextNodeId[o2]’, so for the root node (stage 0, node 0) the next node will
be (stage 1, node 4) if the observation is o0, (stage 1, node 1) if the observation is o1, and
again (stage 1, node 1) for observation o2 because a search action can not discriminate
object type. The declaration nodes have no future nodes, which is indicated with ‘X’
characters. Notice there are two possible information states (beliefs) for π(2) at nodeId 0
during the second-to-last stage, and therefore the conditional observation probabilities
at this node are path-dependent (the nodes represent a convex region of belief-space,
they do not represent a unique probability vector). The red star and black box in the
figure indicate the two different possible beliefs (and therefore the two different possible
sets of observation likelihoods) for this node. The vector πnew(1) represents the future
belief-state after one time interval if no action is taken. In other words the state is
non-stationary in this example and in fact a HMM model was imposed on the state of
each location. The HMM has an arrival probability (chance of leaving the ‘empty’ state)
at each stage of 5%: the probability of a location being empty goes from 80% to 76% in
one stage (0.05 × 0.80 = 0.76), and this probability mass diffuses elsewhere (increases
the chance of state ‘military’).
One caveat w.r.t. the software package we used (an extensively modified version of
31
Tony Cassandra’s pomdp-solve-5.3 [Cassandra, 1999]) that comes into play with a time-
varying state is that an observation at stage t (undesirably) refers to the system state
at time t+1. This is the convention used in the robotics community but is not desirable
in an SM context: it’s anti-causal. There is no difference if the state is stationary. We
will pick up this topic of non-stationarity again in the next section and then as one of
the main topics of Ch. 4.
Each possible terminal belief-state is indicated along with the associated proba-
bility of classification error in the lower-right corner. By way of example, while the
P(error|π(3;o=0,0,0)) = 0.0052, the expected contribution of this error to the terminal
classification cost is even smaller because the likelihood of this particular outcome is
the joint probability of the associated observations: P(y0 = 0, y1 = 0, y2 = 0) = P(y0 =
0)P(y1 = 0)P(y2 = 0) = 0.7184∗0.8567∗0.8724 (the numbers in blue along this realiza-
tion of the decision-tree). The variables in the conditioning were suppressed for brevity.
Unfortunately, walking the decision-trees to back out classification costs is rather slow
(recursive function calls) with large trees, requiring on the order of 15% of the compu-
tational time in simulations with horizon 6 plans (PBVI took around 80%), however at
least this operation is parallelizable, and the PBVI algorithm is parallelizable as well!
As a slightly more complex example of a set of POMDP solutions and what tracing
decision-trees entails, consider Fig. 2·7 and Fig. 2·8 which show a pair of decision-trees
for a horizon 6 scenario with D = 3 and Ms = 3 (plus a ‘wait’ action). The state
‘empty’ has been added to X and a ‘search’ mode has been added to the action space.
The ‘search’ mode is able to quickly detect the presence or absence of an object but
completely unable to specify object type. In addition, an HMM has been used instead
of having the state be stationary such that the model allows for a non-zero probability
of object arrivals from one stage to the next. This example uses an object arrival
probability of 5% / stage. It is interesting to note the situations in which the optimal
32
Figure 2·6: Illustration of “tracing” or “walking” a decision-tree for a
POMDP subproblem to calculate expected measurement and classification
costs (the individual costs from the total).
33
Figure 2·7: Strategy 1 (mixture weight=0.726). πi(0) = [0.1 0.6 0.2 0.1]’
∀ i ∈ [0, . . . , 9], πi(0) = [0.80 0.12 0.06 0.02]T
∀ i ∈ [10, . . . , 99]. The first 10
objects start with node 5, the remaining 90 start with node 43. The notation
[i Ni0 Ni1 Ni2] indicates the next node/action from node i as a function of
observing the 0th, 1st or 2nd observations respectively.
strategy is to wait to act versus to gain as much information as possible with the time
available.
2.2.4 Violation of Stationarity Assumptions
At first glance, using a HMM state model with arrival probabilities, as in Fig. 2·6,
seems to violate our stationarity assumptions that allowed us to decompose the original
problem into one in which we have parallel virtual time-lines happening at each location
and where we did not need to worry about the sequencing of events between these
locations. Given stationarity, the order in which locations are sensed does not matter.
We notice that the same thing is true for a HMM with arrival probabilities because
having an object arrive at a location does not influence the optimal choice of sensing
34
Figure 2·8: Strategy 2 (mixture weight=0.274) πi(0) = [0.1 0.6 0.2 0.1]’ ∀ i ∈
[0, . . . , 9], πi(0) = [0.80 0.12 0.06 0.02]T
∀ i ∈ [10, . . . , 99]. The first 10 objects
start with node 6, the remaining 90 start with node 18.
actions at that location in the past when the location was empty. If a location is
empty, there is no sensing to be done. An arrival only affects the best choices of sensing
action for that location in the future, and we replan every round. Therefore we can
still decouple sensing actions across locations when we have arrivals. The problem in
developing a model for a time-varying state is how to handle object departures. If an
object departs (a location becomes empty), now the best choice of previous actions for
that location are affected retroactively.
2.3 Column Generation And POMDP Subproblem Example
In this section we present an example of the Column Generation algorithm and POMDP
algorithms discussed previously. In this simple example we consider 100 objects (N=100),
2 possible object types (D=2) with X = {‘non-military vehicle’, ‘military vehicle’}, and
35
2 sensors that each have one mode (S = 2 and Ms = 1 ∀ s ∈ {1, 2}). Sensor s actions
have resource costs: rs, where r1 = 1, r2 = 2. Sensors return 2 possible observation
values, corresponding to binary object classifications, with likelihoods:
P(y1,1(t)|xi, u1(t)) P(y2,1(t)|xi, u2(t))
0.90 0.10
0.10 0.90
0.92 0.08
0.08 0.92
where the (j, k) matrix entry denotes the likelihood that y = k if xi = j. The second
sensor has 2% better performance than the first sensor but requires twice as many
resources to use. Each sensor has Rs = 100 units of resources, and can view each
location. Each of the 100 locations has a uniform prior of πi = [0.5 0.5]T
∀ i. For the
performance objective, we use c(xi, vi) = 1 if xi = vi, and 0 otherwise, so the cost is
1 unit for a classification error.
Table 2.2 demonstrates the Column Generation solution process. The first three
columns are initialized by guessing values of resource prices and obtaining the POMDP
solutions, yielding expected costs and expected resource use for each sensor at those
resource prices. A small LP is solved to obtain the optimal mixture of the first three
strategies γ1, . . . , γ3, and a corresponding set of dual prices. These dual prices are used
in the POMDP solver to generate the fourth column γ4, which yields a strategy that is
different from that of the first 3 columns. The LP is re-solved for mixtures of the first
4 strategies, yielding new resource prices that are used to generate the next column. This
process continues until the solution using the prices after 7 columns yields a strategy
that was already represented in a previous column, terminating the algorithm. The
optimal mixture combines the strategies of the second, fifth and sixth columns. When
the master problem converges, the optimal cost, J∗
, for the mixed strategy is 5.95 units.
The resulting decision-trees are illustrated in Fig. 2·9, where branches up indicate
measurements y = 1 (‘non-military’) and down y = 2 (‘military’). The red and green
36
γ1 γ2 γ3 γ4 γ5 γ6 γ7
min 50.0 2.80 2.44 1.818 8 10 6.22
R1 0 218 200 0 0 100 150 ≤ 100
R2 0 0 36 800 200 0 18 ≤ 100
Simplex 1 1 1 1 1 1 1 = 1
Optimal
cost - - 26.22 21.28 7.35 5.95 5.95
Mixture
weights 0 0.424 0 0 0.500 0.076 0
λc
1 1.0e15 0.024 0.010 0.238 0.227 0.217 0.061
λc
2 1.0e15 0.025 0.015 0 0.060 0.210 0.041
Table 2.2: Column Generation example with 100 objects. The tableau is
displayed in its final form after convergence. λc
s describe the lambda trajectories
up until convergence. R1 and R2 are resource constraints. γ1 is a ‘do-nothing’
strategy. Bold numbers represent useful solution data.
Figure 2·9: The 3 pure strategies that correspond to columns 2, 5 and 6 of
Table 2.2. The frequency of choosing each of these 3 strategies is controlled by
the relative proportion of the mixture weight qc ∈ (0..1) with c ∈ {2, 5, 6}.
37
nodes denote the final decision, vi, for a location. Note that the strategy of column 5
uses only the second sensor, whereas the strategies of columns 2 and 6 use only the
first sensor. The mixed strategy allows the soft resource constraints to be satisfied with
equality. Table 2.2 also shows the resource costs and expected classification performance
of each column.
The example illustrates some of the issues associated with the use of soft constraints
in the optimization: the resulting solution does not lead to SM strategies that will
always satisfy the hard constraints Eq. 2.2. The Column Generation algorithm generates
approximate SM solutions that can be infeasible in two different respects: first of all,
when improbable observations happen then the expected resource usage can deviate
significantly from the optimal choice of action based on the new information; surprising
observations can create unexpected surpluses or shortages in sensor resource budgets and
the effect of these variations needs to be mitigated. The second way in which the Column
Generation solution is approximate is w.r.t. the mixture weights used to randomize
between the various strategies in the solution set. It is possible that the distribution
(probability mass function) that determines the relative frequency of utilization of each
pure strategy will be degenerate; this tends to happen towards the end of a simulation
when resources are running low. If the governing distribution is degenerate and only
one pure strategy has support, this pure strategy gives the optimal solution to the SM
problem (though still just w.r.t. expected resource usage). If multiple pure strategies
have utility, there will frequently be prescribed courses of action, as determined by the
operative LP relaxation, that choose to do e.g. 10% of an action that costs 10 units
and 50% of an action that costs 2 units (and 40% of the time to do nothing) if there
are 2 units of resources left. When there are large numbers of resources and actions
yet to be done, this type of approximation works out well, but when there are just a
few resources remaining such sensing plans can give results that are always infeasible.
38
We address these approximation errors and how to handle the termination issues in the
next chapter.
39
Chapter 3
Receding Horizon Control with
Approximate, Mixed Strategies
In the previous chapter we have discussed the theoretical basis for Receding Horizon
(RH) control based on mixtures of pure strategies that near-optimally solve an approx-
imate version of the SM problem given by Eq. 2.5 with constraints Eq. 2.2 and Eq. 2.4.
As previously alluded to in Section 2.2, there are several difficulties that arise with
making use of these mixed strategies. The development of a RH control algorithm that
can make the most of these approximate, mixed strategies without unduly sacrificing
performance (or seen from the dual perspective, wasting resources) represents one of the
main contributions of this dissertation. This chapter develops a set of RH (aka Model
Predictive Control (MPC) or Open-Loop Feedback Control (OLFC)) algorithms that
use replanning to deal with the approximate nature of these mixed strategy solutions.
We explore three different alternatives for generating pure strategies from a set of mixed
strategies and explore the costs and benefits of each of these methods numerically. In
addition to experimenting with these different methods for using the mixed strategies,
we explore various possible parameter configurations for the planning horizon, resource
levels, FA to MD ratios, sensor homogeneity versus heterogeneity and sensor visibility
conditions (using fractional factorial design of experiments).
40
3.1 Receding Horizon Control Algorithm
The Column Generation algorithm described in Eq. 2.11 - Eq. 2.14 solves the approx-
imate SM problem Eq. 2.9 with “soft” constraints in terms of mixed strategies that,
on average, satisfy the resource constraints. However, for control purposes, one must
select actual SM actions that satisfy the hard constraints Eq. 2.2 and Eq. 2.4. Another
issue is that the solutions of the decoupled POMDPs provide individual sensor schedules
for each location that must be interleaved into a single coherent sensor schedule on a
common time-line. Furthermore, exact solution of the small, decoupled POMDPs for
each set of prices can be time consuming, making the resulting algorithm unsuitable for
real-time SM. To address this issue, we explore a family of RH algorithms that convert
the mixed strategy solutions discussed in the previous chapter to actions that satisfy the
hard constraints, and limit the computational complexity of the resulting algorithm.
These algorithms for RH control start at stage t with an information state/resource
state pair, consisting of available information about each location i = 1, . . . , N repre-
sented by the conditional probability vector πi(t) and available sensor resources Rs(t),
s = 1, . . . , S. The first step in the RH algorithms is to solve the SM problem of Eq. 2.5
starting at stage t to final stage T subject to the soft constraints Eq. 2.8, using the hi-
erarchical Column Generation/POMDP algorithms to obtain a set of mixed strategies.
We introduce a parameter corresponding to the maximum number of sensing actions per
location to control the resulting computational complexity of the POMDP subproblem
solutions.
The second step is to select sensing actions to implement at the current stage t from
the mixed strategies. These strategies are mixtures of at most S + 1 pure strategies,
with associated probabilistic weights. We explore three approaches for selecting sensing
actions:
41
• str1: Select the pure strategy with maximum probability.
• str2: Randomly select a pure strategy per location according to the optimal mix-
ture probabilities.
• str3: Select the pure strategy with positive probability that minimizes the expected
sensor resource use over all sensors (and leaves resources for use in future stages).
The column gen simulator that we have developed also supports two other methods
for converting mixed strategies to sensing actions:
• str4: Select the pure strategy that minimizes classification cost.
• str5: Randomly select a single pure strategy for all locations jointly according to
the optimal mixture probabilities.
However these latter two methods for RH control were deemed to be less useful and
therefore were not included in the fractional factorial design of experiments analysis.
The pure strategies that are selected for each location map the current information
sets, Ii(t) for location i, into a deterministic sensing action. ‘str1’ and ‘str3’ choose
the same pure strategy to use across all locations, but ‘str2’ chooses a pure strategy
on a location-by-location basis. Note that there may not be enough sensor resources
to execute the selected actions, particularly in the case where the pure strategy with
maximum probability is selected. To address this, we rank sensing actions by their
expected entropy gain [Kastella, 1996]:
Gain(us(t)) =
H(πi(t)) − Ey[H(πi(t + 1))|y, us(t)]
rs(us(t))
(3.1)
where Ey[] is the expected future entropy value. We schedule sensor actions in order
of decreasing expected entropy gain, and perform those actions at stage t that have
enough sensor resources to be feasible. We also use the Entropy Gain algorithm at the
42
very end of a simulation when resources are nearly depleted, and the higher cost sensor
modes are no longer feasible, see Appendix B for more information. (When the horizon
is very short, the Entropy Gain algorithm is nearly-optimal, so this does not constitute
a significant performance limitation in our design).
The measurements collected from the scheduled actions are used to update the infor-
mation states πi(t + 1) using Eq. 2.3. The resources used by the actions are eliminated
from the available resources to compute Rs(t + 1) using Eq. 2.4. The RH algorithm
is then executed from the new information state/resource state condition in iterative
fashion until all resources are expended.
3.2 Simulation Results
In order to evaluate the relative performance of the alternative RH algorithms, we
performed a set of simulations. In these experiments, there were 100 locations, each of
which could be empty, or have objects of three types, so the possible states of location i
were xi ∈ {0, 1, 2, 3} where type 1 represents cars, type 2 trucks, and type 3 military
vehicles. Sensors can have several modes: a ‘search’ mode, a low resolution ‘mode1’ and
a high resolution ‘mode2’. The search mode primarily detects the presence of objects;
the low resolution mode can identify cars, but confuses the other two types, whereas the
high resolution mode can separate the three types. Observations are modeled as having
three possible values. The search mode consumes 0.25 units of resources, whereas the
low-resolution mode consumes 1 unit and the high resolution mode 5 units, uniformly
for each sensor and location. Table 3.1 shows the likelihood functions that were used in
the simulations.
Initially, each location has a state with one of two prior probability distributions:
πi(0) = [0.10 0.60 0.20 0.10]T
, i ∈ [1, . . . , 10] or πi(0) = [0.80 0.12 0.06 0.02]T
, i ∈
[11, . . . , 100]. Thus, the first 10 locations are likely to contain objects, whereas the other
43
Search Low-res Hi-res
o1 o2 o3 o1 o2 o3 o1 o2 o3
empty 0.92 0.04 0.04 0.95 0.03 0.02 0.95 0.03 0.02
car 0.08 0.46 0.46 0.05 0.85 0.10 0.02 0.95 0.03
truck 0.08 0.46 0.46 0.05 0.10 0.85 0.02 0.90 0.08
military 0.08 0.46 0.46 0.05 0.10 0.85 0.02 0.03 0.95
Table 3.1: Observation likelihoods for different sensor modes with the ob-
servation symbols o1, o2 and o3. Low-res = ‘mode1’ and High-res = ‘mode2’.
90 locations are likely to be empty. When multiple sensors are present, they may share
some locations in common, and have locations that can only be seen by a specific sensor,
as illustrated in Fig. 3·1.
In general we consider the resource levels: 300 units (resource poor scenario), 500 units
(normal resources) and 700 units (resource rich scenario), where these numbers are the
cumulative amounts for all sensors involved in a simulation. However, in situations in-
volving search and classification, the available resources per sensor were scaled back.
If for example a search and classify simulation had 2 sensors involved, then resources
were reduced by a factor of 2 ∗ 5 = 10 because there are 2 sensors and because only
about 20–25% of the locations are occupied, and the unoccupied locations can be ruled
out very cheaply: if there were 80 empty locations, then it would be possible to search
them all twice for just 40 resource units. About 85% of the likely-to-be-empty locations
could be safely ruled out with 2 looks, and all of the more expensive measurements get
focused on more ambiguous locations with the remaining 20–100 units (depending on
the setup) of resources between both sensors. (Keeping resources calibrated this way
makes comparisons possible across simulation parameter sets).
The cost function used in the experiments, c(xi, vi) is shown in Table 3.2. The pa-
rameter MD represents the cost of a missed detection, and is varied in the experiments.
The variable “Horizon” is the total number of sensor actions allowed per location plus
one additional action for estimating the location content (i.e. making a classification
44
Figure 3·1: Illustration of scenario with two partially-overlapping sensors.
xi; vi empty car truck military
empty 0 1 1 1
car 1 0 0 1
truck 1 0 0 1
military MD MD MD 0
Table 3.2: Decision costs
decision).
Table 3.3 shows simulation results for a search and classify scenario involving 2 iden-
tical sensors (with the same visibility profile), evaluating the 3 alternative versions of
the RH control algorithms with 3 resource levels: 30, 50 and 70 units. Table 3.4 dis-
plays the accompanying lower bound performance computed for these simulations. The
missed detection cost MD is varied from 1 to 5 to 10. (The MD cost affects the expected
classification cost as discussed in Fig. 2·4). The results shown in Table 3.3 represent the
average of 100 Monte Carlo simulation runs of the 3 RH algorithms. Graphical versions
of this table are provided in Fig. 3·2 - Fig. 3·4.
For a horizon 6 plan, the longest horizon studied, the simulation performance is
close to that of the associated bound. The results show that the different methods of
RH control have performance close to the optimal lower bound in most cases, with the
exception being the case of MD = 5 with 70 units of sensing resources (per sensor).
For shorter horizons sometimes the simulation performance is better than the “bound”
because the MPC algorithm is allowing more sensor observations per object than the
45
plan is accounting for, — after doing multiple planning iterations. (A simulation may
re-plan 5+ times before resources are exhausted, but the bound was computed assuming
there will only be a maximum of two sensing opportunities per object for a horizon=3
problem). Obviously, the easily classified objects are ruled out of the competition for
sensor resources (their identities are decided upon) early in the simulation process,
and with every additional planning iteration, sensor resources are concentrated on the
remaining objects whose identities are uncertain. In this situation the approximate
sensor plan from the Column Generation algorithm does not match well with the way
events unfold, and the “bound” is not a bound. However, after choosing a horizon that
accounts appropriately for how many RH planning iterations there will be and how
many sensing opportunities will take place at each location, the bounds are tight.
In terms of which strategy is preferable for converting the mixed strategies to a pure
strategy, the results of Table 3.3 are unclear. For short planning horizons in the RH
algorithms, the preferred strategy appears to be to use the least resources (str3): because
planning with a longer horizon improves performance minimally, we find that using a RH
replanning approach with a short horizon in conjunction with a resource-conservative
planning strategy can be used to reduce computation time with limited performance
degradation. For the longer horizons, there was no significant difference in performance
among the three strategies we investigated.
In the next set of experiments, we compare the use of heterogeneous sensors that
have different modes available. In these experiments, the 100 locations are guaranteed
to have an object, so xi = 0 is not feasible. The prior probability of object type for
each location is πi(0) = [0 0.7 0.2 0.1]T
. Table 3.5 shows the results of experiments with
sensors that have all sensing modes, versus an experiment where one sensor has only
a low-resolution mode and the other sensor has both high and low-resolution modes.
The table shows the lower bounds predicted by the Column Generation algorithm, to
46
MD = 1 MD = 5 MD = 10
Hor. 3 str1 str2 str3 str1 str2 str3 str1 str2 str3
Res 30 3.64 3.85 3.85 11.82 12.88 12.23 15.28 14.57 14.50
Res 50 2.40 2.80 2.43 6.97 6.93 7.84 10.98 9.99 10.45
Res 70 2.45 2.32 1.88 3.44 3.99 4.04 6.14 6.48 5.10
Hor. 4
Res 30 3.58 3.46 3.52 12.28 12.62 11.90 14.48 15.91 15.59
Res 50 2.37 2.21 2.33 7.44 7.44 7.20 9.94 9.28 10.65
Res 70 1.68 1.33 1.60 3.59 3.57 3.62 6.30 5.18 5.86
Hor. 6
Res 30 3.51 3.44 3.73 11.17 11.85 12.09 15.17 14.99 13.6
Res 50 2.28 2.11 2.31 7.29 8.02 7.70 10.67 10.47 11.25
Res 70 1.43 1.38 1.44 3.60 3.73 3.84 4.91 5.09 5.94
Table 3.3: Simulation results for 2 homogeneous, multi-modal sensors in
a search and classify scenario. str1: select the most likely pure strategy for
all locations; str2: randomize the choice of strategy per location according to
mixture probabilities; str3: select the strategy that yields the least expected
use of resources for all locations. See Fig. 3·2 - Fig. 3·4 for the graphical version
of this table.
MD
Horizon 3 1 5 10
Res[30, 30] 4.96 12.11 14.64
Res[50, 50] 4.09 8.79 11.16
Res[70, 70] 3.38 6.20 8.19
Horizon 4
Res[30, 30] 4.24 11.86 14.56
Res[50, 50] 3.09 6.72 9.50
Res[70, 70] 2.16 4.24 5.94
Horizon 6
Res[30, 30] 3.35 11.50 13.85
Res[50, 50] 2.21 6.27 9.40
Res[70, 70] 1.32 2.95 4.96
Table 3.4: Bounds for the simulations results in Table 3.3. When the horizon
is short, the 3 MPC algorithms execute more observations per object than were
used to compute the “bound”, and therefore, in this case, the bounds do not
match the simulations; otherwise, the bounds are good.
47
1 2 3
0
5
10
15
FA:MD = 1 Bound = 4.960
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 12.110
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 14.640
E[J]
1 2 3
0
5
10
15
FA:MD = 1 Bound = 4.090
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 8.790
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 11.160
E[J]
1 2 3
0
5
10
15
FA:MD = 1 Bound = 3.380
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 6.200
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 8.190
E[J]
Horizon = 3
Resources =
60.0
Resources =
100.0
Resources =
140.0
Figure 3·2: This figure is the graphical version of Table 3.3 for horizon 3.
Simulation results for two sensors with full visibility and detection (X=’empty’,
’car’, ’truck’, ’military’) using πi(0) = [0.1 0.6 0.2 0.1]T
∀ i ∈ [0..9], πi(0) =
[0.80 0.12 0.06 0.02]T
∀ i ∈ [10..99]. There is one bar in each sub-graph for
each of the three simulation modes studied in this chapter. The theoretical
lower bound can be seen in the upper-right corner of each bar-chart.
1 2 3
0
5
10
15
FA:MD = 1 Bound = 4.240
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 11.860
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 14.560
E[J]
1 2 3
0
5
10
15
FA:MD = 1 Bound = 3.090
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 6.720
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 9.500
E[J]
1 2 3
0
5
10
15
FA:MD = 1 Bound = 2.160
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 4.240
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 5.940
E[J]
Horizon = 4
Resources =
60.0
Resources =
100.0
Resources =
140.0
Figure 3·3: This figure is the graphical version of Table 3.3 for horizon 4.
48
1 2 3
0
5
10
15
FA:MD = 1 Bound = 3.350
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 11.500
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 13.850
E[J]
1 2 3
0
5
10
15
FA:MD = 1 Bound = 2.210
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 6.270
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 9.400
E[J]
1 2 3
0
5
10
15
FA:MD = 1 Bound = 1.320
E[J]
1 2 3
0
5
10
15
FA:MD = 5 Bound = 2.950
E[J]
1 2 3
0
5
10
15
FA:MD = 10 Bound = 4.960
E[J]
Horizon = 6
Resources =
60.0
Resources =
100.0
Resources =
140.0
Figure 3·4: This figure is the graphical version of Table 3.3 for horizon 6.
illustrate the change in performance expected from the different architectural choices
of sensors. The results indicate that specialization of one sensor can lead to significant
degradation in performance due to inefficient use of its resources.
The next set of results explore the effect of spatial distribution of sensors. We con-
sider experiments where there are two homogeneous sensors which have only partially-
overlapping coverage zones. (We define a “visibility group” as a set of sensors that have
a common coverage zone). Table 3.6 gives bounds for different percentages of overlap.
Note that, even when there is only 20% overlap, the achievable performance is similar to
that of the 100% overlap case in Table 3.5, indicating that proper choice of strategies can
lead to efficient sharing of resources from different sensors by equalizing their workload.
The last set of simulation results we consider show the performance of these RH
algorithms for three homogeneous sensors with partial sensor overlap, no detection and
varying resource levels. The visibility groups are graphically portrayed in Fig. 3·5.
Table 3.7 presents the simulated cost values averaged over 100 simulations of the three
different RH algorithms. See Table 3.8 for the accompanying lower bounds. Fig. 3·6 -
Fig. 3·8 display the graphical version of these tables. The results support our previous
49
Homogeneous Heterogeneous
MD MD
Horizon 3 1 5 10 1 5 10
Res[150, 150] 5.689 16.928 30.380 6.338 18.150 31.233
Res[250, 250] 4.614 16.114 25.917 5.527 16.767 29.322
Res[350, 350] 4.225 15.301 21.453 5.123 16.414 27.411
Horizon 4
Res[150, 150] 5.016 16.059 20.606 5.641 16.849 20.606
Res[250, 250] 3.939 9.461 12.662 4.576 12.047 14.873
Res[350, 350] 3.352 8.578 12.474 4.275 9.407 12.651
Horizon 6
Res[150, 150] 4.618 15.661 19.564 5.271 16.202 19.564
Res[250, 250] 2.919 8.237 10.913 3.321 8.830 11.347
Res[350, 350] 2.175 4.860 7.151 2.658 6.629 9.174
Table 3.5: Comparison of lower bounds for 2 homogeneous, bi-modal sensors
(left 3 columns) versus 2 heterogeneous sensors in which S1 has only ‘mode1’
available but S2 supports both ‘mode1’ and ‘mode2’ (right 3 columns). There
is 1 visibility-group with πi(0) = [0.7 0.2 0.1]T
∀ i ∈ [0..99]. For many of the
cases studied there is a performance hit of 10–20%.
Overlap 60% Overlap 20%
MD MD
Horizon 3 1 5 10 1 5 10
Res[150, 150] 5.69 16.93 30.38 5.69 16.93 30.38
Res[150, 150] 4.61 16.11 25.98 4.61 16.11 25.92
Res[150, 150] 4.23 15.30 21.45 4.23 15.30 21.45
Horizon 4
Res[150, 150] 5.02 16.06 20.61 5.02 15.93 20.61
Res[150, 150] 3.94 9.46 12.66 3.94 9.46 12.66
Res[150, 150] 3.35 8.58 12.47 3.35 8.58 12.47
Horizon 6
Res[150, 150] 4.62 15.66 19.56 4.62 15.66 19.56
Res[150, 150] 2.92 8.25 10.91 2.94 8.24 10.91
Res[150, 150] 2.18 4.86 7.19 2.18 4.86 7.16
Table 3.6: Comparison of sensor overlap bounds with 2 homogeneous, bi-
modal, sensors and 3 visibility-groups. Both configurations use the prior πi(0) =
[0.7 0.2 0.1]T
. Compare and contrast with the left half of Table 3.5, most of
the time the two sensors have enough objects in view to be able to efficiently
use their resources for both the 60% and 20% overlap configurations; only the
bold numbers are different.
50
Figure 3·5: The 7 visibility groups for the 3 sensor experiment indicating the
number of locations in each group.
conclusions: when a short horizon is used in one of the RH algorithms, and there are
sufficient resources, the pure strategy that uses the least resources is preferred as it
allows for replanning when new information is available. If the RH algorithm uses a
longer horizon, then its performance approaches the theoretical lower bound, and the
difference in performance between the three approaches for sampling the mixed strategy
to obtain a pure strategy is statistically insignificant.
To illustrate the computational requirements of this scenario (4 states, 3 observations,
2 sensors (6 actions), full sensor-overlap), the number of columns generated by the
Column Generation algorithm to compute a set of mixed strategies was on the order of
10–20 columns for the horizon 5 algorithms, which takes about 60 sec on a 2.2 GHz,
single-core, Intel P4 machine under Linux using C code in “Debug” mode (with 1000
belief-points for PBVI). Memory usage without optimizations is around 3 MB. There are
typically 4–5 planning sessions in a simulation before resources are exhausted. Profiling
indicates that roughly 80% of the computing time goes towards Value Backups in the
PBVI routine and 15% goes towards tracing decision-trees in order to back out (deduce)
the measurement costs from hyperplane costs (see Section 2.2.3). A set of simulations
51
MD = 1 MD = 5 MD = 10
Hor. 3 str1 str2 str3 str1 str2 str3 str1 str2 str3
Res 100 5.26 6.08 5.57 17.23 17.44 16.79 22.02 21.93 22.16
Res 166 5.91 4.81 3.13 10.23 11.91 9.21 14.19 16.66 12.85
Res 233 3.30 3.75 3.43 10.15 9.32 5.88 14.49 12.55 8.21
Hor. 4
Res 100 5.32 5.58 5.93 17.26 16.88 16.17 21.92 20.94 21.35
Res 166 3.42 4.07 3.24 8.63 8.00 9.04 12.05 11.71 14.08
Res 233 3.65 3.07 3.29 5.27 7.14 5.38 8.25 10.08 7.90
Hor. 6
Res 100 5.79 5.51 5.98 17.13 17.90 17.44 22.03 20.56 22.17
Res 166 2.96 2.68 2.72 10.22 8.33 9.08 9.82 11.47 11.57
Res 233 1.52 2.00 1.70 4.81 4.13 4.24 5.64 7.20 5.11
Table 3.7: Simulation results for 3 homogeneous sensors without using de-
tection but with partial overlap as shown in Fig. 3·5. See Fig. 3·6 - Fig. 3·8 for
the graphical version.
MD
Horizon 3 1 5 10
Res[30, 30] 5.69 16.93 30.38
Res[50, 50] 4.61 16.11 25.89
Res[70, 70] 4.26 15.31 21.48
Horizon 4
Res[30, 30] 5.02 15.92 20.61
Res[50, 50] 3.94 9.46 12.66
Res[70, 70] 3.35 8.58 12.48
Horizon 6
Res[30, 30] 4.62 15.66 19.56
Res[50, 50] 2.92 8.22 10.89
Res[70, 70] 2.18 4.87 7.18
Table 3.8: Bounds for the simulations results in Table 3.7. When the horizon
is short, the 3 MPC algorithms execute more observations per object than were
used to compute the bound, and therefore, in this case, the bounds do not
match the simulations; otherwise, the bounds are good.
52
Figure 3·6: This figure is the graphical version of Table 3.7 for horizon 3.
Situation with no detection but limited visibility (X=’car’, ’truck’, ’military’)
using πi(0) = [0.70 0.20 0.10]T
∀ i ∈ [0..99]. There were 7 visibility-groups:
20x001, 20x010, 20x100, 12x011, 12x101, 12x110, 4x111. The 3 bars in each
sub-graph are for ‘str1’, ‘str2’, ‘str3’ respectively. The theoretical lower bound
can be seen in the upper-right corner of each bar-chart.
Figure 3·7: This figure is the graphical version of Table 3.7 for horizon 4.
53
Figure 3·8: This figure is the graphical version of Table 3.7 for horizon 6.
with 81 parameter combinations, 100 Monte Carlo runs for each combination, 4 states,
3 observations, 2 sensors (6 actions) and 7 visibility groups as in Fig. 3·5 required 5 days
to compute and entailed solving on the order of 2 million POMDPs. (As a reality check,
the number of seconds in 5 days is 432,000 and 432, 000/2.0e6 = 0.216 sec / POMDP,
which makes sense).
The problem with handling sensor visibility in this way is that we are again in the
land of combinatorics, which does not scale well for large numbers of sensors. If there are
3 sensors, then in general there are 23
−1 possible combinations for how sensors can cover
an area (every combination accept the trivial “no sensor has visibility” combination is
considered), and POMDP subproblems must be solved for each combination. However,
at least in real-world scenarios, sensors are not likely to all be in one geographic location,
and there will not be the need to solve POMDP subproblems according to every possible
combination of sensors with visibility in that region.
In terms of the computational complexity of our RH algorithms, the main bottleneck
is obviously the solution of the POMDP problems. The LPs solved in the column
generation approach are small and are solved in minimal time. Solving the POMDPs
54
required to generate each column (one POMDP for each visibility group in cases with
partial sensor overlap) is tractable by virtue of the hierarchical breakdown of the SM
problem into independent subproblems. It is also possible to parallelize the POMDP
computations and even the columns generated in Column Generation using multi-core
CPU or GPU processors.
Our results suggest that RH control with modest horizons of 2 or 3 sensor actions
per location can yield performance close to the best achievable performance using mixed
strategies that are resource-conservative. This result is all the more significant in that
there are a number of hyperplanes that support the optimal value function that is super-
exponential in the planning horizon (PBVI does not necessarily capture all of these
hyperplanes); a shorter horizon is a huge computational savings. If shorter horizons are
used to reduce computation, then an approach that samples mixed strategies by using
the smallest amount of resources (while still using resources) is preferred. These results
also show that, with proper SM, geographically distributed sensors with limited visibility
can be coordinated to achieve equivalent performance to centrally pooled resources.
55
Chapter 4
Adaptive SM with State Dynamics
This chapter considers several extensions to the basic SM problem in Section 2.2. We
have thus far primarily concerned ourselves with problems involving stationary states
at the locations being investigated and sensor platforms that can observe locations in
any order they choose. The stationary state assumption is a significant limitation, and
so is the assumption that locations are always visible to automata. To develop a more
realistic algorithm, we consider two alternative extensions that generalize the baseline
model of Section 2.2 to handle either per location state dynamics or known per location
visibility dynamics. While we do not attempt to include all of these additional model
features in one unified algorithm, this chapter provides the basis for subsequent research
in this domain.
Much of the formulation in this chapter is similar but not identical to Section 2.2.
Slight differences exist concerning indexing and variable definitions (e.g. xi versus xi(t),
t ≤ T − 1 versus t ≤ T etc. . . ), so we reproduce relevant parts of the formulation here
instead of referring to Section 2.2.
4.1 Time-varying States Per Location
The first extension we consider is a relaxation of the assumption that each location has a
static state. As previewed with Fig. 2·7, Fig. 2·8 and Fig. 2·6, in this section we assume
that the state at each location has Markov dynamics. We use a Markov Birth-Death
Process, generalized to handle a discrete set of possible “live” states per location, to
56
model the potential for new objects to show up unexpectedly or for previously known
objects to disappear without warning. Since the per location states are unobservable,
each of the N locations will have an associated Hidden Markov Model (HMM) that
describes the state dynamics for that location. Our decomposition approach requires
that a common set of actions (sensor modes that consume resources) be used across lo-
cations, but otherwise this formulation allows the HMM (states, transition probabilities,
observation probabilities) to vary across locations. Every distinct HMM will necessitate
its own POMDP subproblem solution, so the flexibility comes with a loss of solution
generality.
Assume there are a finite number of locations 1, . . . , N, each of which may at a
particular time have an object of a given type or may be empty. Let there be S sensors,
each of which has multiple sensor modes indexed as m = 1, . . . , Ms, and assume that
each sensor can observe a set of locations at each discrete time-instant (stage) with a
mode selected per location.
Let xi(t) ∈ {0, 1, . . . , D} denote the state of location i at time t, where xi(t) = 0 if
location i is unoccupied, and otherwise xi(t) = k > 0 indicates location i contains an
object of type k at time t. Let πi(0) ∈ ℜD+1
be a discrete a priori probability distribution
over the possible states for the ith
location for i = 1, . . . , N where D ≥ 2. Assume
additionally that the random variables xi(t) for i = 1, . . . , N are mutually independent
for each time t. Let the state of each of the N locations be governed by an independent
Markov chain such as the example in Fig. 4·1. In our model the transition probabilities
are specified as a stochastic matrix {pjk} of dimension (D + 1) × (D + 1) that has as
elements the (stationary) probabilities pjk = P(xi(t + 1) = j|xi(t) = k). We use these
transition probabilities to give locations an arrival probability pai
of transitioning from
an ‘empty’ state to a non-empty state and a departure probability pdi
of transitioning
from a non-empty state to an ‘empty’ state. These probabilities may depend on the
57
Figure 4·1: An example HMM that can be used for each of the N loca-
tions. pa is an arrival probability and pd is a departure probability for the
Markov chain.
initial state for departures or final state for arrivals.
There are s = 1, . . . , S sensors, each of which has m = 1, . . . , Ms possible modes of
observation. Let there be a series of T discrete decision stages with t = 1, . . . , T for
sensors to make measurements. Each sensor s has a limited set of locations that it can
observe at each stage, denoted by Os(t) ⊆ {1, . . . , N}. At each stage, each sensor can
choose to employ one of its sensor modes to collect noisy measurements concerning the
states xi(t) of the sensed locations in its Field of View (FOV) (location i is in the FOV
of sensor s if i ∈ Os(t)).
To define the cost, we assume that a tentative decision is made concerning the
identity (state) of each location at the end of each stage. We assume the following
model of causality: at time t, a subset of the N locations are sensed using various
sensor modes, statistics (beliefs) concerning the states of these locations are updated,
and (still at time t) a classification decision (aka “declaration”) is made about the state
of each location. The process repeats for each successive time-step until T stages of
58
time have elapsed. The action space for each stage is then the Cartesian product of
the set of feasible sensor-mode assignments for the N locations with the set of tentative
classification decisions for the N locations.
A sensor action by sensor s at stage t is the set of pairs:
us(t) = {(is(t), ms(t)) | is(t) ∈ Os(t), ms(t) ∈ Ms} (4.1)
where each pair consists of a location to observe is(t), and a sensor mode (independent
for each location) used to observe this location, ms(t), where the mode is restricted to
the set of feasible modes given the resource levels for each sensor. We assume that no two
sensors observe the same location at the same time in order to minimize the complexity
of the associated action and observation spaces. Let ui,s(t) refer to the sensor action
taken on location i with sensor s at stage t if any, or let ui,s(t) = ∅ otherwise.
Sensor measurements are modeled as belonging to a finite set ys,m ∈ {1, . . . , Ls}. The
likelihood of the measured value is assumed to depend on the sensor s, sensor mode m,
location i and on the true state at the location, xi(t), but not on the states of other loca-
tions (statistical independence). Denote this likelihood as P(ys,m(t)|xi(t), i, s, m). Thus,
the Markov models are observed through noisy measurements, resulting in HMMs. We
assume that this likelihood given xi(t) is time-invariant, and that the random measure-
ments ys,m(t) are conditionally independent of other measurements yσ,n(τ) given the
states xi(t), xj(τ) for all sensors s, σ and modes m, n provided i = j or τ = t.
Assume each sensor has a quantity Rs of resources available for measurements during
each stage, so there is a periodic constraint on sensor utilization. Associated with the
use of mode m by sensor s on location i at time t is a resource cost rs(ui,s(t)) to use this
mode, representing power or some other type of resource required to operate the sensor:
i∈Os(t)
rs(ui,s(t)) ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [1 . . . T] (4.2)
59
This is a hard constraint for each realization of observations and decisions.
Let I(t) denote the sequence of past sensing actions and measurement outcomes up
to and including stage t − 1:
I(t) = {(ui,s(τ), ys,m(τ))| i ∈ Os(τ); s = 1, . . . , S; τ = 1, . . . , t − 1}
Define I(0) as the prior probability π(0) = p(x(0)) = N
i=1 p(xi(0)). Under the assump-
tion of conditional independence of measurements and independence of the Markov
chains governing each location, the joint probability π(t) = P(x1(t) = k1, x2(t) =
k2, . . . , xN (t) = kN |I(t)) can be factored as the product of belief-states (marginal con-
ditional probabilities) for each location. Denote the belief-state at location i as πi(t) =
p(xi(t)|I(t)). The belief-state πi(t) is a sufficient statistic to capture all information
that is known about location i at time t. When a sensor measurement is taken, the
belief-state is updated according to Bayes’ Rule. A measurement of location i with the
sensor-mode combination ui,s(t) = (i, m) at stage t that generates observable ys,m(t)
updates the belief-vector as:
πi(t + 1) =
diag{P(ys,m(t)|xi(t) = j, i, s, m)}πi(t)
1T
diag{P(ys,m(t)|xi(t) = j, i, s, m)}πi(t)
(4.3)
where 1 is the D + 1 dimensional vector of all ones. Eq. 4.3 captures the relevant
information dynamics that SM controls with our HMM state formulation.
Given the information I(t) at stage t, the quality of the information collected is
measured by making an estimate of the state xi(t) of each location i given the available
information (the information history of observations and actions and the initial prob-
ability vector π(t)). Denote these estimates as vi(t) ∀ i = 1, . . . , N. The Bayes’ cost
of selecting estimate vi(t) when the true state is xi(t) is denoted as c(xi(t), vi(t)) ∈ ℜ
with c(xi(t), vi(t)) ≥ 0. Typically we assume c(xi(t), vi(t)) to be a 0–1 symmetric cost
matrix, or else a matrix with 0 cost along the diagonal and FA and MD cost terms off
60
the diagonal in the appropriate locations (relative to which state is not to be missed
and which state is a nuisance-detection).
The objective of this problem is to estimate the state of each location at each time
with minimum error as measured by the number of FAs and MDs:
J = min
γ∈Γ
E
γ
N
i=1
T
t=1
c(xi(t), vi(t)) (4.4)
subject to Eq. 4.2. The minimization is done over the (countable) space of admissible,
adaptive feedback strategies γ ∈ Γ. In this context, a strategy γ is a time-varying
mapping from an information set (history) to a sensor action and tentative classification
decision.
After replacing the hard resource constraint in Eq. 4.2 with an expected-resource-use
constraint for each of the S sensors, we have the constraints:
i∈Os(t)
E[rs(ui,s(t))] ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [1 . . . T] (4.5)
We can dualize these resource constraints and create an augmented objective-function
(Lagrangian) of the form:
Jλ = min
γ∈Γ
E
γ


N
i=1
T
t=1
c(xi(t), vi(t)) −
S
s=1
T
t=1
λs(t)

Rs −
i∈Os(t)
rs(ui,s(t))



 (4.6)
This problem is a lower bound on the original problem Eq. 4.4 with sample path con-
straints Eq. 4.2 because every strategy that satisfies the original sample path constrained
problem is feasible for the relaxed problem.
In order to proceed with this derivation, we need a theoretical justification for why
choosing actions for locations on an individual basis will not detrimentally affect the
optimal cost on a global basis now that locations have time-varying state. The work of
[Casta˜n´on, 2005a] provides such a result for a stationary-state case, we need to generalize
61
this theory for state dynamics. The idea that classification performance should not
suffer seems rather intuitive considering that the states of each location are statistically
independent, however there is still the coupling mechanism governed by resource usage.
Consider the following lemma (which uses some terms, e.g. “local strategies”, defined in
Section 2.2.1):
Lemma 4.1.1 (Optimality of Local Adaptive Feedback Strategies). Given a SM prob-
lem with periodic resource constraints, an independent HMM governing the state of each
location i ∀ i ∈ [1, . . . , N], and the Lagrange multiplier trajectories λs(t) ∀ s, t, the perfor-
mance of an optimal, non-local, adaptive feedback strategy γ is equal to the performance
of N, locally optimal, adaptive feedback strategies γi.
Proof.
We have the following inequality:
min
γ∈Γ
E
γ
N
i=1
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t)) ≥
N
i=1
min
γ∈Γ
E
γ
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t)) (4.7)
because on the right-hand side the minimum for each term in the sum can use a different
strategy, whereas on the left hand side the same strategy must be used for all N terms.
Now, consider the minimization problem for each location i:
min E
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t))
We can solve this problem via SDP. We break the decision problem at each stage t into
two stages: first, we select ui,s(t) and collect information on object i. Then, we select
vi(t), the tentative classification. At the final stage, consider the selection of vi(T) as a
function of the complete information state I(T), collected over the entire set of locations,
62
as:
J∗
i (I(T), T) = min
vi(T)
E [c(xi(T), vi(T))|I(T)]
= min
vi(T)
E [c(xi(T), vi(T))|Ii(T)] ≡ J∗
i (Ii(T), T)
because of the independence of xi(T) from other xj(T) and conditional independence
of the observations of location i from those of other locations. These independence
assumptions imply p(xi(T)|I(T)) = p(xi(T)|Ii(T)). Thus, the optimal decision, vi(T),
and the optimal cost-to-go will be a function only of Ii(T) and not all of I(T).
Now, assume inductively that for stages τ > t, the optimal cost-to-go J∗
i (I(τ), τ) ≡
J∗
i (Ii(τ), τ) depends only on the information collected at location i, and the strategy for
the optimal decision vi(τ) and measurements ui,s(τ +1) for s = 1, . . . , S depends only on
Ii(τ) and not all of I(τ). Consider the minimization over the choice of vi(t), ui,s(t + 1),
s = 1, . . . , S. Under γ, these are functions of the entire information state I(t). Bellman’s
equation becomes:
J∗
i (I(t), t) = min
vi(t),ui,s(t+1)
E c(xi(t), vi(t)) +
S
s=1
λs(t + 1)rs(ui,s(t + 1))
+ E J∗
i (Ii(t + 1), t + 1) I(t), {ui,s(t + 1)} I(t)
= min
vi(t),ui,s(t+1)
E c(xi(t), vi(t)) +
S
s=1
λs(t + 1)rs(ui,s(t + 1))
+ E J∗
i (Ii(t + 1), t + 1) Ii(t), {ui,s(t + 1)} Ii(t)
because of the same independence assumptions, which implies that p(xi(t)|I(t)) =
p(xi(t)|Ii(t)) and:
E J∗
i (Ii(t + 1), t + 1) I(t), {ui,s(t + 1)} = E J∗
i (Ii(t + 1), t + 1) Ii(t), {ui,s(t + 1)}
63
Hence, by induction through DP, we have shown:
min
γ∈Γ
E
γ
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t)) =
min
γi∈ΓL
E
γi
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t))
Thus:
min
γ∈Γ
E
γ
N
i=1
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t)) ≥
N
i=1
min
γi∈ΓL
E
γi
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t)) (4.8)
where ΓL is the set of admissible, local, adaptive feedback strategies, and γi ∈ ΓL maps
Ii(t) to (vi(t), {ui,s(t + 1)}).
To complete the proof, note that feedback strategies of the form γ = (γ1, γ2, . . . , γN )
are admissible strategies for the optimization problem on the left. Hence, the optimal
local strategies γi achieve equality in the above equation, establishing the lemma.
The cost function can be decoupled into the sum of N local cost functions as follows:
Jλ = min
γ∈Γ
E
γ


N
i=1
T
t=1
c(xi(t), vi(t)) +
S
s=1
T
t=1 i∈Os(t)
λs(t)rs(ui,s(t))
−
S
s=1
T
t=1
λs(t)Rs


Jλ = min
γ∈Γ
E
γ
N
i=1
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t))I(i ∈ Os(t))
−
T
t=1
S
s=1
λs(t)Rs
64
where I(·) is an indicator function (meaning is context-sensitive). Now using Lemma 4.1.1
we have:
Jλ =
N
i=1
min
γi∈ΓL
E
γi
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t))I(i ∈ Os(t))
−
T
t=1
S
s=1
λs(t)Rs (4.9)
To develop a new lower bound for this formulation with HMM subproblems, we write
Eq. 4.9 in terms of mixed strategies as we did in Section 2.2.1:
Jλ =
N
i=1
min
γi∈Q(ΓL)
E
γi
T
t=1
c(xi(t), vi(t)) +
S
s=1
λs(t)rs(ui,s(t))I(i ∈ Os(t))
−
T
t=1
S
s=1
λs(t)Rs (4.10)
where Q(ΓL) is the set of all mixtures of the pure strategies in the set ΓL. On account
of the fact that we have a relaxed form of resource constraint in Eq. 4.2, we know that
the actual optimal cost in Eq. 4.4 must be lower bounded by Eq. 4.10 because we have
expanded the space of feasible actions. This identification leads to the inequality:
J∗
≥ sup
λ1,...,λS≥0
Jλ1,...,λS
(4.11)
from weak-duality in Linear Programming. Eq. 4.11 is the dual of the LP:
min
q∈Q(ΓL)
γi∈ΓL
q(γi) E
γi
N
i=1
T
t=1
c(xi(t), vi(t)) (4.12)
γi∈ΓL
q(γi) E
γi


i∈Os(t)
rs(ui,s(t))

 ≤ Rs ∀ s ∈ [1, . . . , S], and t ∈ [1, . . . , T] (4.13)
γi∈ΓL
q(γi) = 1 (4.14)
65
where we have one constraint for each of the S sensor resource pools for each time t and
an additional simplex constraint in Eq. 4.14 which ensures that q ∈ Q(ΓL) forms a valid
probability distribution.
With this decomposition and our new lower bound in place, we are able to formulate
individual POMDPs for each of the subproblems according to location-specific dynam-
ics we want to model, provided that we do not violate our statistical independence
assumptions. We can use the same Column Generation techniques that were presented
in Section 2.2, but this time with an expanded set of constraints: we need to use one
constraint per-sensor-per-time. The extra constraints imply that the optimal basis de-
termined via Lagrangian Relaxation will now have ST + 1 variables, though different
variables (pure strategies) will have support at different times. In consequence, we have
randomized strategies that mix not only in terms of which sensor is utilized, but when
a sensor is utilized. At any given stage, most of these mixture coefficients will be 0,
and just a few pure strategies will be employed. The pure strategies that have support
will be a time-varying set determined by which sensors can see which objects at various
points over the planning horizon.
We have already implemented algorithms that handle a Markov Birth Process at each
location, and the preceding derivation provides the theoretical justification for modeling
not just object arrivals, but object departures. Our existing algorithm, with suitable
modifications, can be applied to solve this problem with reasonable computational time.
The question is just how the number of generated columns scales with the number of
constraints in the Column Generation routine; this is a matter for future research.
4.2 Time-varying Visibility
As a surrogate for “locality constraints” on sensing operations, in Ch. 3 we divided
locations into sets of groups, termed “visibility groups”, and gave sensors a 0–1 type
66
(static) constraint for which locations they could observe. In this section we consider a
model in which sensor-location visibility is time-varying but known and deterministic.
Consider a sensor management and object classification problem in which there are
a finite number of locations 1, . . . , N, each of which may have an object with a given
type or may be empty. In this formulation we assume that locations do not change their
class-affiliation, but may change visibility over time. The visibility of locations is taken
to be a known, boolean sequence of indicators that determine when locations can be
observed on a location-by-location basis; each location has its own visibility trajectory.
This model would be appropriate for example with a satellite that is scheduled to pass
over an area. Let there be S sensors, each of which has multiple sensor modes indexed
as m = 1, . . . , Ms, and assume that each sensor can observe a set of locations at each
discrete time-instant (stage) with a mode selected per location.
Let xi ∈ {0, 1, . . . , D} denote the state of location i, where xi = 0 if location i is
unoccupied, and otherwise xi = k with k > 0 indicates that location i contains an
object of type k. Let πi(0) ∈ ℜD+1
be the initial, discrete probability distribution over
the possible object types for the ith
location for i = 1, . . . , N where D ≥ 2. Assume
additionally that the random variables xi for i = 1, . . . , N are mutually independent.
Therefore the joint probability π(t) = P(x1 = k1, x2 = k2, . . . , xN = kN |I(t)) represent-
ing the collection of all information known about the N locations can be factored into
the product of N marginal, per-location, conditional distributions as π(t) = N
i=1 πi(t).
Let there be a series of T discrete decision stages with t = 0, . . . , T −1 where sensors
can make measurements, and assume all locations must be classified at or before stage T
(terminal classification cost). Each sensor s has a limited set of locations that it can
observe at each stage, denoted by Os(t) ⊆ {1, . . . , N}. At each stage t, each sensor
can choose to employ one or more of its sensor modes to collect noisy measurements
concerning the states xi of sensed locations where i ∈ Os(t). We assume there are
67
no data association problems concerning where a sensor observation comes from, so
observables (measurements) can be assigned to the locations that generated them with
certainty. A sensor action by sensor s at stage t is the set of pairs:
us(t) = {(is(t), ms(t)) | is(t) ∈ Os(t), ms(t) ∈ Ms} (4.15)
where each pair consists of a location to observe is(t), and a sensor mode (independent
for each location) used to observe this location, ms(t), where the mode is restricted to
the set of feasible modes given the resource levels for each sensor. We assume that no two
sensors observe the same location at the same time in order to minimize the complexity
of the associated action and observation spaces. Let ui,s(t) refer to the sensor action
taken on location i with sensor s at stage t if any, or let ui,s(t) = ∅ otherwise.
Sensor measurements are modeled as belonging to a finite set ys,m ∈ {1, . . . , Ls}. The
likelihood of the measured value is assumed to depend on the sensor s, sensor mode m,
location i and on the true state at the location xi, but not on the states of other locations.
Sensor measurements of non-visible locations are constrained to be uncorrelated with the
states of those locations and therefore have no information. Denote the likelihood of a
sensor measurement of location i as P(ys,m(t)|xi, i, s, m). We assume that this likelihood
given xi is time-invariant, and that the random measurements ys,m(t) are conditionally
independent of other measurements yσ,n(τ) given the states xi, xj for all sensors s, σ
and modes m, n provided i = j or τ = t. A word is required concerning the causality
between sensing actions and sensor observations. For our purposes we are considering
that a sensor action at stage t generates a sensor measurement that is seen at stage t.
Stage t + 1 is the first time in which the information observed at stage t can be acted
upon.
Each sensor has a quantity Rs of resources available for measurements during each
decision stage. Associated with the use of mode m by sensor s on location i at time t
68
is a resource cost rs(us(t)) to use this mode, representing power or some other type of
resource required to operate the sensor:
i∈Os(t)
rs(ui,s(t)) ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [0 . . . T − 1] (4.16)
This is a hard constraint for each sample-path: each and every possible realization
of observations and actions. There are no sensor resource dynamics per se because
resource availability at a later stage does not depend on resource expenditures at earlier
stages. There is however a time-varying demand for sensing resources according to when
the visibility constraints dictate are the best and worst times for observing locations.
These constraints model the limited information-processing bandwidth of sensors. The
S sensors are treated as a set of finite-capacity resource pools wherein making sensor
measurements consumes resources, however resources are renewed with every new time-
step; (aside from battery-powered devices), sensing-capacity is a renewable resource.
Concerning the set Os(t) of visible locations at stage t, consider the following dy-
namics: location i has a deterministic visibility profile, wi,s(t), for each sensor s that
is known a priori: wi,s : N+
→ {0, 1} where the indicator ‘1’ indicates the location is
visible. In this case, we assume that a sensor can look at a series of locations in any
order so long as the ordering preserves visibility constraints, so Os(t) = {i | wi,s(t) =
1, i ∈ [1, . . . , N]} ∀ s, t. The sensor platforms themselves have no physical state repre-
sentation. In this context, the action space is the set of all feasible sensor modes and
feasible sensing locations (for all sensors) at a particular time.
Let I(t) denote the sequence of past sensing actions and measurement outcomes up
to and including stage t − 1:
I(t) = {(ui,s(τ), ys,m(τ)) | i ∈ Os(τ); s = 1, . . . , S; τ = 0, . . . , t − 1}
69
Define I(0) as the prior probability π(0) = p(x) = N
i=1 p(xi). Under the assumption
of conditional independence of measurements and independence of individual states at
each location, the joint probability π(t) = P(x1 = k1, x2 = k2, . . . , xN = kN |I(t))
can be factored as the product of belief-states (marginal conditional probabilities) for
each location. Denote the belief-state at location i as πi(t) = p(xi|I(t)). When a
sensor measurement is taken, the belief-state is updated according to Bayes’ Rule. A
measurement of location i with the sensor-mode combination ui,s(t) = (i, m) at stage t
that generates observable ys,m(t) updates the belief-vector as:
πi(t + 1) =
diag{P(ys,m(t)|xi = j, i, s, m)}πi(t)
1T
diag{P(ys,m(t)|xi = j, i, s, m)}πi(t)
(4.17)
where 1 is the D + 1 dimensional vector of all ones. Eq. 4.17 captures the relevant
information dynamics that SM controls through the choice of measurement actions.
The objective for this formulation is to classify, with minimum cost as measured by
FAs and MDs, the state of each location at the end of T stages:
J = min
γ∈Γ
E
γ
N
i=1
c(xi, vi) (4.18)
subject to Eq. 4.16.
After replacing the hard resource constraint in Eq. 4.16 with an expected-resource-
use constraint for each of the S sensors, we have the constraints:
i∈Os(t)
E[rs(ui,s(t))] ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [0 . . . T − 1] (4.19)
We can dualize the resource constraints and create an augmented objective-function
(Lagrangian) of the form:
Jλ = min
γ∈Γ
E
γ


N
i=1
c(xi, vi) −
S
s=1
T−1
t=0
λs(t)

Rs −
i∈Os(t)
rs(ui,s(t))



 (4.20)
70
This problem is a lower bound on the original problem Eq. 4.18 with sample path
constraints Eq. 4.16 because every strategy that satisfies the original sample path con-
strained problem is feasible for the relaxed problem.
Using Lemma 4.1.1 again, the cost function can be decoupled into the sum of N local
cost functions as follows:
Jλ = min
γ∈Γ
E
γ


N
i=1
c(xi, vi) +
S
s=1
T−1
t=0 i∈Os(t)
λs(t)rs(ui,s(t)) −
S
s=1
T−1
t=0
λs(t)Rs


Jλ = min
γ∈Γ
E
γ
N
i=1
c(xi, vi) +
T−1
t=0
S
s=1
λs(t)rs(ui,s(t))I(i ∈ Os(t))
−
T−1
t=0
S
s=1
λs(t)Rs
where I(·) is an indicator function (meaning is context-sensitive).
Jλ =
N
i=1
min
γi∈ΓL
E
γi
c(xi, vi) +
T−1
t=0
S
s=1
λs(t)rs(ui,s(t))I(i ∈ Os(t))
−
T−1
t=0
S
s=1
λs(t)Rs (4.21)
From here on we can see that this problem variation with time-varying but known
object visibility is a straight-forward extension of our existing formulation discussed in
Ch. 3. A lower bound follows that is analogous to that of Eq. 4.12 - Eq. 4.14. The only
difference in the time-varying visibility case is that there is a time-varying set of feasible
actions for each subproblem such that for stages when a subproblem is not visible the
set of available actions (for the subproblem in question) is constrained to the empty
set (waiting). We can again resort to the same Column Generation techniques that
were developed in the previous section as an extension to the algorithm described in
Section 2.2.1.
71
4.3 Summary
In conclusion, we have developed 2 new lower bounds that apply to SM problems that
involve time-varying (HMM) states or known, time-varying location visibility (e.g. for
application to satellite sensing operations). These lower bounds are useful for planning
with RH control techniques. We have described how the Column Generation algorithm
of Section 2.2.1 can be extended to these more general problem formulations, and we have
conducted simulations along these lines for the case where location states use Markov
Birth Process dynamics (as mentioned in Fig. 2·7, Fig. 2·8 and Fig. 2·6). We expect that
the same type of RH control algorithms described in Ch. 3 using Column Generation
with mixtures of pure strategies over sensors and over time (i.e. time-varying Lagrange
multipliers) will give near-optimal performance in a computationally tractable amount
of time.
72
Chapter 5
Adaptive Sensing with Continuous Action
and Measurement Spaces
In this chapter, we present an alternative approach for the inhomogeneous adaptive
search problems studied in [Bashan et al., 2007,Bashan et al., 2008] based on the con-
straint relaxation approach developed in [Casta˜n´on, 1997, Casta˜n´on, 2005b, Hitchings
and Casta˜n´on, 2010]. Our approach decomposes the global two-stage SM problem into
per location two-stage SM problems, coupled through a master problem for partition-
ing sensing resources across subproblems, which is solved using Lagrangian Relaxation.
The resulting algorithm obtains solutions of comparable quality to the grid search ap-
proaches proposed in [Bashan et al., 2007,Bashan et al., 2008] with roughly two orders
of magnitude less computation.
Whereas in Ch. 3 the computation went towards considering a relatively narrow
set of actions with a handful of possible observations over a long (deep) horizon, this
chapter considers a continuum of actions and observations with a shallower horizon.
In this chapter, the states of locations are binary-valued: X = {‘empty’, ‘occupied’}.
Therefore this chapter basically implements the ‘search’ mode of Ch. 3 for the shortest
horizon considered in that chapter but with a much higher resolution sensor model.
The remainder of this chapter proceeds as follows: Section 5.1 describes the two-stage
adaptive SM problem of [Bashan et al., 2007,Bashan et al., 2008]. Section 5.2 describes
our solution to obtain adaptive sensor allocation strategies. Section 5.3 discusses an
73
alternative approach based on DP techniques. In Section 5.4, we provide simulation
results that compare our algorithms with the algorithms in [Bashan et al., 2008].
5.1 Problem Formulation
Consider an area that contains Q locations (or ‘cells’) to be measured with R total units
of resources (e.g. energy) over T = 2 stages indexed by t, starting with t = 0. Let the
cells be indexed by k ∈ [1, . . . , Q]. Define the indicator function Ik = 1 if the kth
cell
contains an object, and let Ik = 0 otherwise. The variables Ik are assumed to be random
and independent, with prior probability πk0 = Pr(Ik = 1).
The decision variables at stages t = 0, 1, are denoted by xkt, k = 1, . . . , Q, corre-
sponding to the sensor resource allocated to each cell k at stage t. As in [Bashan et al.,
2007,Bashan et al., 2008], allocating resources to a cell corresponds to allocating radar
energy to that cell, which improves the signal-to-noise ratio (SNR) of a measurement
there. A measurement generated for each cell k at stages 0, 1, is described by:
Ykt =
√
xktIk + vk(t) (5.1)
where vk(t) are Gaussian, zero-mean, unit variance random variables that are mutually
independent across k and t and independent of Ii for all i. Thus, xkt represents the energy
allocated to cell k at stage t and is also the signal-to-noise ratio for the measurement.
See Fig. 5·1 for a graphical depiction of this sensor model.
Let Yt = [Y1t, Y2t, . . . , YQt] denote the measurements collected across all locations at
stage t. For adaptive sensing, we allow the allocations xk1 to be a function of the ob-
servations Y0. Thus, an adaptive sensor allocation is a strategy γ = {(xk0, xk1(Y0)), k =
1, . . . , Q}. The resource (energy) constraint requires that feasible adaptive sensor allo-
74
−6 −4 −2 0 2 4 6 8 10 12
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Likelihood functions p(Y
k0
| I=1; x
k0
=[0.81 2.83 5.66 ]) and p(Y
k0
|I=0)
yk0
p(Y
k0
|I;x
k0
)
Figure 5·1: Depiction of measurement likelihoods for empty and non-empty
cells as a function of xk0.
√
xk0 gives the mean of the density p(Yk0|Ik = 1). If
the cell is empty the observation is always mean 0 (black curve).
cations satisfy:
Q
k=1
(xk0 + xk1(Y0)) ≤ R for all Y0 (5.2)
Note, the resulting optimization problem has a continuum of constraints for all sample
paths of the observations Y0.
We initially use the following cost function, from [Bashan et al., 2007,Bashan et al.,
2008], to develop an adaptive sensing strategy:
Jγ
= E
γ
Q
k=1
Ik
xk0 + xk1(Y0)
(5.3)
This cost function rewards allocating more sensing resources to cells that are occupied.
There are many other variations that can be considered, but this simple form is sufficient
to illustrate our approach.
The above problem can be solved in principle using SDP, but the cost is not easily
separable across stages. Nevertheless, one can still use nested expectation principles to
75
Figure 5·2: Waterfall plot of joint probability p(Yk0|Ik; xk0) for πk0 = 0.50
for xk0 ∈ [0 . . . 20]. This figure shows the increased discrimination ability that
results from using higher-energy measurements (separation of the peaks).
approach this problem, as outlined in [Bashan et al., 2008]. Define the belief-state:
πk,t+1 = P(Ik = 1|Yt) = P(Ik = 1|Ykt), k = 1, . . . , Q (5.4)
and resource dynamics:
R1 = R −
Q
k=1
xk0
where the last equality in Eq. 5.4 follows from the independence assumptions made on
the cell indicator functions and the measurement noise. Fig. 5·2 is a 3D plot displaying
how increasing energy expenditures improves the discriminatory power of sensor mea-
surements (improves SNR), and Fig. 5·3 demonstrates how this increased discriminatory
power affects the resulting a posteriori probability for a cell, πk1.
76
Figure 5·3: Graphic showing the posterior probability πk1 as a function of
the initial action xk0 and the initial measurement value Yk0. This surface plot
is for λ = 0.01 and πk0 = 0.20. (The boundary between the high (red) and low
(blue) portions of this surface is not straight but curves towards -y with +x.)
Let πt = [π1t, . . . , πQt], x0 = [x10, . . . , xQ0], x1 = [x11, . . . , xQ1]. Then:
min
γ
Jγ
= min
γ
E
γ
Q
k=1
Ik
xk0 + xk1(Y0)
= min
x0
E min
x1(Y0)
E
Q
k=1
Ik
xk0 + xk1(Y0)
Y0
= min
x0
E E min
x1
Q
k=1
Ik
xk0 + xk1(Y0)
Y0
= min
x0
E min
x1
Q
k=1
πk1
xk0 + xk1
(5.5)
For each value of x0 and measurements Y0, the inner minimization is a deterministic
resource allocation problem with a convex, separable objective and a single constraint
Q
k=1 xk1 = R − Q
k=1 xk0, which is straightforward to solve. The algorithm of [Bashan
et al., 2008] then solves the overall problem by enumerating possible values of x0, simu-
77
lating values of Y0 for each x0 and subsequently solving the inner minimization problem
for each x0, Y0 combination to find the best set of adaptive sensor allocations.
5.2 Relaxed Solution
In order to avoid enumeration of all feasible resource allocations at stage 0, we adopt a
constraint relaxation approach that expands the space of feasible sensor allocations, as
in Section 2.2 and [Casta˜n´on, 1997,Casta˜n´on, 2005b]. Specifically, we approximate the
constraints implied by Eq. 5.2 with a single resource constraint that constrains the total
expected energy use:
E
Q
k=1
(xk0 + xk1(Y0)) ≤ R, xkt ≥ 0 (5.6)
This expands the space of admissible sensor allocations, so the solution of the SM prob-
lem with these constraints yields a lower bound to the original problem. We introduce
a Lagrange multiplier λ ≥ 0 to integrate this constraint into an augmented objective
function:
Jγ
(λ) =
Q
k=1
E
Ik
xk0 + xk1(Y0)
+ λ E
Q
k=1
(xk0 + xk1(Y0)) − R (5.7)
Given λ, Eq. 5.7 is additive over the cells, so for the remainder of this chapter we
concentrate on the cost of optimally classifying just one cell. This separability establishes
the following result:
Theorem 5.2.1. An optimal solution of Eq. 5.7 subject to constraints Eq. 5.6 is achieved
by an adaptive sensing strategy where xk1(Y0) ≡ xk1(Yk1) ≡ xk1(πk1).
The proof of this follows along the lines of the results in [Casta˜n´on, 2005b,Castanon
and Wohletz, 2009]. This result restricts the adaptive sensing allocations for each cell k
to depend only on the information collected on that cell, summarized by πk1, and leads
78
to a decomposition of the optimization problem over each cell k for each value of λ ≥ 0.
We analyze this single cell problem next.
Consider the expected cost in cell k, ignoring the term λR that is independent of
xk0, xk1, which is:
min
xk0,xk1
Jλ
k = min
xk0
E min
xk1
E
Ik
xk0 + xk1(Y0)
+ λ(xk0 + xk1(Y0)) Y (0) (5.8)
After computing expectations, the inner minimization becomes:
Jλ∗
k,1(xk0, πk1) = min
xk1
πk1
xk0 + xk1
+ λ(xk0 + xk1) (5.9)
The optimal allocation xk1 can be found through differentiation as:
xk1 =



πk1
λ
− xk0, if x2
k0 < πk1
λ
0 otherwise
(5.10)
The optimal adaptive strategy at stage 1 has two regions: a region where it is not
worth allocating additional resouces in stage 1 to this cell because πk1 is too small and
a region where the cell will receive resources in stage 1. Associated with these regions
is an optimal cost Jλ∗
k,1(xk0, πk1) given by:
Jλ∗
k,1(xk0, πk1) =



2
√
λπk1 if x2
k0 < πk1
λ
πk1
xk0
+ λxk0 otherwise
(5.11)
So in general, the overall inner minimization at time 1 has the form:
Jλ∗
k,1(xk0, πk1) =
πk1
xk0
+ λxk0 I x2
k0 ≥
πk1
λ
+ 2 λπk1 I x2
k0 <
πk1
λ
(5.12)
79
Recall that πk1 = p(Ik = 1|Yk0; xk0), so it is computed from Bayes’ rule as:
πk1 =
p(Yk0|Ik = 1; xk0)πk0
p(Yk0|Ik = 1; xk0)πk0 + p(Yk0|Ik = 0; xk0)(1 − πk0)
(5.13)
We now analyze the boundary defined by the constraint in Eq. 5.11 that determines
whether or not one measurement should be made or two and thus which of the two
mutually-exclusive terms in the cost function will be active. Starting from Eq. 5.11 and
using Bayes’ rule, we define the set Y (xk0, λ) as all yk0 (for a given λ) such that:
x2
k0 <
1
λ
N(yk0;
√
xk0, 1)πk0
N(yk0;
√
xk0, 1)πk0 + N(yk0; 0, 1)(1 − πk0)
Note that Y (xk0, λ) is a monotone increasing set as πk0 increases, and is monotone
decreasing in λ. After some simplifications, this set is equivalent to all Yk0 that satisfy
the following inequality:
2 log xk0 < − log λ +
√
xk0Yk0 −
xk0
2
− log exp(−
1
2
(xk0 − 2
√
xk0Yk0)) +
(1 − πk0)
πk0
(5.14)
When this inequality is true, then Yk0 ∈ Y (xk0, λ) and xk1 > 0, otherwise xk1 = 0.
Fig. 5·4 illustrates this boundary.
In Fig. 5·5, the boundary that defines Y (xk0, λ) as a function of xk0 and Yk0 for an
arbitrary value of λ is shown for illustrative purposes. This surface is basically the log of
the posterior probability πk1 shown in Fig. 5·3 with a vertical offset given by −2 log(xk0).
The figure Fig. 5·6 gives 3 cross-sections through the boundary/surface of Fig. 5·5.
Fig. 5·7 is a two-factor exploration of the parameter-space (λ,πk0) that we used in
order to investigate the structure of this boundary. Examining the monotonicity of the
boundary is important in order to determine that our algorithm is well behaved, i.e.
that it will always converge on the optimal choice of Lagrange multiplier that decouples
80
Figure 5·4: Cost function boundary (see Eq. 5.14) with λ = 0.011 and
πk0 = 0.18. In the lighter region two measurements are made, in the darker
region just one. (Note positive y is downwards.)
Figure 5·5: The optimal boundary for taking one action or two as a function
of (xk0, Yk0) (for the [Bashan et al., 2008] cost function) for λ = 0.01 and
πk0 = 0.20. The curves in Fig. 5·6 represent cross-sections through this surface
for the 3 x-values referred to in that figure.
81
−6 −4 −2 0 2 4 6 8 10 12
−15
−10
−5
0
5
10
Measurement Boundry for Having/Not−Having Follow−up
Action for xk0
=[0.81 2.83 5.66 ] λ=0.0100 p=0.20
yk0
γ(xk0
,λ)
Figure 5·6: This figure gives another depiction of the optimal boundary
between taking one measurement action or two for the [Bashan et al., 2008]
cost function. For all Y (xk0, λ) ≥ 0 two measurements are made (and the
highest curve is for the smallest xk0, see Fig. 5·5 for the 3D surface from which
these cross-sections were taken).
the individual subproblems.
We now develop an expression for the outer expectation in Eq. 5.8, which defines the
problem of optimizing xk0, as follows: let p(yk0; xk0) = p(Yk0|Ik = 1; xk0)πk0 +p(Yk0|Ik =
0; xk0)(1 − πk0). Then using Eq. 5.13 yields:
E[Jλ∗
k,1(xk0, πk1)] =
y∈Y (xk0,λ)
2 λπk1 p(yk0; xk0)dyk0
+
y /∈Y (xk0,λ)
πk1
xk0
+ λxk0 p(yk0; xk0)dyk0
=
y∈Y (xk0,λ)
2 λπk0 N(yk0;
√
xk0, 1) p(yk0; xk0)dyk0
+
y /∈Y (xk0,λ)
πk0
xk0
N(yk0;
√
xk0, 1) + λxk0 p(yk0; xk0) dyk0
To minimize E[Jλ∗
k,1(xk0, πk1)] with respect to xk0, we have to evaluate the integrals
above. Note, the regions of integration also depend on xk0. Although it is possible
82
Figure 5·7: Two-factor exploration to determine how the optimal boundary
between taking one measurement or two measurements varies for a cell with
the parameters (p, λ) where p = πk0 (for the [Bashan et al., 2008] problem cost
function). Two measurements are taken in the darker region, one measurement
for the lighter region.
83
to take derivatives and try to solve for possible minima, it is simpler to evaluate the
integrals by numerical quadrature and obtain the optimal value xk0 by numerical search
in one dimension.
The above procedure computes the optimal strategy for the relaxed problem of
Eq. 5.7 for fixed λ. To obtain a solution to the original problem with the relaxed
constraints of Eq. 5.6, we do an optimization over λ: a one-dimensional concave max-
imization problem where the direction of ascent is readily identified. A subgradient
direction is given by:
∂Jγ
(λ) = E
Q
k=1
(xk0 + xk1(Y0)) − R
=
Q
k=1
y∈Y (xk0,λ)
πk1
λ
p(yk0; xk0)dyk0
+
Q
k=1
y /∈Y (xk0,λ)
xk0 p(yk0; xk0)dyk0 − R
Note, the relevant expectations have already been evaluated by quadrature in the search
for the optimal xk0 for each λ, which makes searching for the optimal λ straightforward.
The algorithm above obtains optimal adaptive sensor allocations for optimizing
Eq. 5.3 subject to the relaxed constraints in Eq. 5.6. To obtain sensor allocations
that satisfy the original constraints of Eq. 5.2, we use the sensor allocations {xk0, k =
1, . . . , Q} determined by our procedure above, collect the vector of observations Y0 across
all cells, and then replan for the optimal allocation of the remaining resources, enforcing
the constraint Eq. 5.2 for the specific observations Y0. This stage 1 optimization problem
is a straightforward deterministic separable convex minimization problem with a single
additive constraint, and can be solved analytically in finite time by indexing the cells,
as described in [Bashan et al., 2008].
84
5.3 Bayesian Objective Formulation
One of the factors that makes the adaptive SM problem in [Bashan et al., 2008] complex
is that the objective function Eq. 5.3 does not depend on the observed measurements at
stage 1, Y1. As argued in [Bashan et al., 2008], this objective is related to lower bounds
in performance such as Cramer Rao bounds on the estimation of Ik or Chernoff bounds
on correct classification, particularly for open-loop allocations. The resulting cost is not
separable across stages, and is a hybrid of bounds and actual expected performance.
A direct approach would have been to define a cost at stage 2 that depends on the
measurements of both stages 0 and 1, along with a decision stage that generates an
estimate or a classification for each cell, as in [Casta˜n´on, 2005b], and to use DP tech-
niques for partially observed Markov decision processes (POMDPs), suitably extended
to incorporate continuous-valued action and measurement spaces.
We assume that for each cell k at stage 2 there will be a classification decision uk2 ∈
{0, 1}, which depends on the observed measurements Y0, Y1. For any cell k where we have
collected Yk0, Yk1, denote the conditional probability πk2 = P(Ik = 1|Yk0, Yk1; xk0, xk1).
Let MD denote a constant representing the relative cost of a missed detection at the
end of 2 stages where the false alarm cost (FA) is held constant at 1. The Bayes’ cost
we seek to optimize is:
JBayes
= E
Q
k=1
(MD Ik(1 − uk2) + (1 − Ik)uk2)
The surfaces of Fig. 5·8 show how the MD and FA costs vary with the values of xk1
and Yk1 after making the first measurement. These figures are cost-samples for the cost
that results after each outcome (xk1,Yk1) at the last stage. The FA classification cost
is independent of the amount of energy used in the second measurement, but the MD
classification cost is not.
85
Figure 5·8: Plot of cost function samples associated with false alarms, missed
detections and the optimal choice between false alarms and missed detections
(for the Bayes’ cost function).
86
Incorporating the relaxed constraints of Eq. 5.6, we get an augmented cost that can
be decomposed over cells, along the lines of the formulation in the previous section, to
obtain the cell-wise optimization objective:
JBayes
k (λ) = E[(MD Ik(1 − uk2) + (1 − Ik)uk2) + λ(xk0 + xk1)]
where xk1 depends on Yk0, and uk2 depends on Yk0, Yk1. Using DP, the sufficient statistic
for this problem is the information state πkt for each stage t. The cost-to-go at stage 2
is:
J∗
k,2(πk2) = min MD πk2, (1 − πk2)
The DP recursion is now:
Jλ∗
k,1(πk1) = min
xk1
E min MD
p(Yk1|1; xk1)πk1
p(Yk1; xk1)
, (1 −
p(Yk1|1; xk1)πk1
p(Yk1; xk1)
) + λxk1 (5.15)
Using this DP recursion yields the optimization problem for xk0:
min
xk0
E Jλ∗
k,1(πk1) + λxk0
Fig. 5·9 shows a set of plots for the Bayes’ cost function when t = 1 for the proba-
bilities πk1 = 0.0991 and πk1 = 0.3493 for λ = 0.0001. The first row in the figure
shows cost-samples for the unaugmented cost function, the middle row for Eq. 5.15 and
the final row shows the corresponding joint-probabilities for each prior. These figures
demonstrate that the higher the value of πk1, the larger the amount of cost associated
with a measurement Yk1 that is near 0 (the measurement is ambiguous and so the chance
of making a mistake and paying a classification cost is relatively large). Fig. 5·10 de-
scribes the boundary between determining a cell to be empty / occupied for the Bayes’
cost function with γ(xk1) defined as the observation value Yk1 that makes the FA and
MD costs equal.
87
Figure 5·9: This figure shows cost-to-go function samples as a function of
the second sensing-action xk1 and the second measurement Yk1 for the Bayes’
cost function. These plots use 1000 samples for Yk1 and 100 for xk1.
88
Figure 5·10: Threshold function for declaring a cell empty (risk of MD) or
occupied (risk of FA).
From an algorithmic perspective, we compute Jλ∗
k,1(πk1) for a discrete set of points on
the unit interval. For each point πk1, a discrete set of observations and a discretized set of
allocations xk1 are evaluated to compute the resulting conditional probabilities πk2, and
a cost-to-go value J∗
k,2(πk2) is obtained by interpolation. Summing over measurements
for each xk1 and multiplying by a scale factor yields expectations (quadrature), and
minimizing over xk1 leads to the cost-to-go value at πk1. This procedure is closely
related to the PBVI algorithm, see Appendix A.2. A similar procedure takes place at
stage 0, except that only one belief-point for πk0 needs to be considered.
The above algorithm depends on the value of λ. We find the optimal λ with a
search identical to the one in the previous section. The approach described above has
several major advantages: first, the cost is additive over stages, and thus allows direct
application of DP techniques. Second, the separability and shift-invariance of the costs
89
allows for computation of a single cost-to-go function Jλ∗
k,1(πk1) for all cells k, which is a
significant savings in computation.
5.4 Experiments
We now consider some simulation results. Simulations were done using MATLAB on
a 2.2 Ghz, single-core, Intel P4, Linux machine. We ran two separate simulation con-
figurations. The first was an independent experiment and the second was meant to be
compared against the algorithm of [Bashan et al., 2007,Bashan et al., 2008].
In the first configuration, we used 1000 points to discretize the observation space
for quadrature, 100 discrete points to search for optimal sensor allocations xk0 with a
line search and 500 units of energy, which gives an SNR (= 10 log10 (R
Q
)) of 6.99 units.
The values reported in the experiments are averages over 100 simulation runs using
Q = 100 cells. A set of prior probabilities was created for the 100 simulations using
100 independent samples from a gamma distribution with shape 2 and scale 3. The
values were scaled and thresholded to fall in the range [0.05 . . . 0.80]. The net result was
a vector of prior probabilities with most of its values around 0.10 − 0.20 and with a few
higher probability elements.
We first focus on comparing the adaptive sensor allocation algorithm using the re-
laxation approach of Section 5.2. Fig. 5·11 displays the initial resource allocation as a
function of the value of the prior probability of each cell πk0. The amount of resource
initially allocated to measuring a cell is a monotonically increasing function of the chance
that an object is there, as the cost function rewards spending resources on cells that
contain objects. Fig. 5·12 shows a similar behavior for the total expected resource alloca-
tions per cell (with the expected value of the follow-up resource allocations (xk1) making
up the difference between Fig. 5·11 and Fig. 5·12). The striations seen in Fig. 5·11 are
artifacts of the resource allocation quantizations at stage 0, which had a granularity of
90
around 0.2 units of energy. There are more points on these graphs for small values of πk0
because a prior was used with a relatively low probability of object occurrence. Fig. 5·13
is very similar to Fig. 5·12 because the classification cost is completely determined by
the total amount of resource spent and the prior probability of an object being present.
In terms of computation time, determining the optimal adaptive sensor allocation (for
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2
2.5
3
3.5
4
4.5
5
5.5
6
π(0)
energy
First Stage Energy Allocation vs Prior Probability
Figure 5·11: The 0th stage resource allocation as a function of prior prob-
ability. The striations are an artifact of the discretization of resources when
looking for optimal xk0.
the case of non-uniform priors) required around two minutes for our MATLAB imple-
mentation. Subsequent evaluation of the performance using Monte Carlo runs required
around 10 seconds per simulation. To summarize the performance statistics, the aver-
age simulation cost over 100 simulations was 2.85 units, and the sample variance of the
simulation cost was 0.40 units.
In order to compare our results with the results of [Bashan et al., 2008] for the second
simulation configuration, we obtained a MATLAB version of their algorithm with a
91
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2
3
4
5
6
7
8
9
10
11
π(0)
energy
Total Energy Allocation vs Prior Probability
Figure 5·12: Total resource allocation to a cell as a function of prior proba-
bility. The point-wise sums of the 0th stage and 1st stage resource expenditures
are displayed here.
specific problem implementation that contained a grid of Q = 1024 cells and used an
action space discretized to 100 energy levels. Their algorithm evaluated expectations
using 500 random samples for Y0. We used 1000 energy levels and an observation space
discretized to 100 observations. The algorithms in [Bashan et al., 2008] assume a uniform
prior probability πk0 = 0.01, which reduces the need for a Q-dimensional enumeration
to a one dimensional enumeration because of symmetry, so xk0 is constant across all
cells k. To get an overall SNR of 10, the resource (energy) level R was set to 10240
so that, on average, there are 10 units of resources per cell. We used 100 Monte Carlo
simulations to evaluate the performance of the adaptive sensor allocations.
With this second configuration involving an order of magnitude more cells, our MAT-
LAB algorithm created an adaptive sensor allocation plan in 9.5 seconds by exploiting
the symmetry across cells. Evaluation of performance with the 100 Monte Carlo simu-
92
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.02
0.04
0.06
0.08
0.1
0.12
0.14
π(0)
cost
Cost vs Prior Probability
Figure 5·13: Cost associated with a cell as a function of prior probability. For
the optimal resource allocations, there is a one-to-one correspondence between
the cost of a cell and the resource utilized to sense a cell.
lations required an additional 17.7 seconds to do the 100 simulations. We ran the same
experiments using the MATLAB code implementing the approach described in [Bashan
et al., 2008], which required 100.3 sec to compute adaptive sensor allocations, and 2.7 sec
of simulation time to evaluate performance (because their algorithm does not require in-
line replanning to ensure feasibility). The results are summarized in Table I. The results
show that our algorithm achieves performance close to that of the optimal algorithm
described in [Bashan et al., 2008] with significantly lower computational requirements.
The above comparison does not highlight the fact that our algorithm is scaleable
for problems with non-uniform prior information. In essence, for our algorithm non-
uniform priors would increase computation time by a factor of Q, the number of distinct
cells that require evaluation using our relaxation approach. In contrast, computation
times for the algorithm of [Bashan et al., 2008] would require enumeration of 1024Q
93
Average Standard Solution Simulation
Cost Deviation Time Time
Relaxation 0.2762 0.14 9.5 17.7
Exact 0.2666 0.15 100.3 2.7
Table 5.1: Performance comparison averaged over 100 Monte Carlo simu-
lations. Relaxation is the algorithm proposed in this chapter, while Exact
is the algorithm of [Bashan et al., 2008]
0 0.1 0.2 0.3 0.4 0.5 0.59 0.69 0.79 0.89 0.99
0
0.01
0.03
0.05
0.07
0.09
0.1
0.12
0.14
0.16
0.17
π1
(k)
J
λ
1
(k)
Figure 5·14: Cost-to-go from πk1
locations versus 1024 locations, an exponential increase in time, making consideration
of nonuniform priors an infeasible problem.
As a final set of experiments, we implemented the algorithm of Section 5.3 using the
first simulation configuration with 100 cells and MD = 1. We used a grid of 1000 belief-
points to represent the cost-to-go function, with cubic spline interpolation for values in
between points. The optimal cost-to-go at stage 1 in the Bayes’ cost function is shown in
Fig. 5·14 and the corresponding optimal energy allocations are in Fig. 5·15. The results
show the expected cutoff where no action is taken once enough certainty exists in πk1.
94
0 0.1 0.2 0.3 0.4 0.5 0.59 0.69 0.79 0.89 0.99
0
0.8
1.61
2.42
3.23
4.04
4.84
5.65
6.46
7.27
8.08
π
1
(k)
x*
1
(k)
Figure 5·15: Optimal stage 1 energy allocations.
Fig. 5·16 describes the initial resource allocations. These allocations are symmetrical
w.r.t. πk0; since most of the priors are small, the figure has more points for small values.
In terms of computation time, this algorithm required about 6 minutes of MATLAB
time to obtain the sensor plan, and around 6 seconds to run each simulation. However,
this number will not increase substantially for problems with more cells, because the
cost-to-go functions are reused across cells. Note also that the computations are trivially
parallelizable, as the cost-to-go can be computed in parallel for each information state.
In this chapter, we developed alternative algorithms that directly exploit Lagrangian
Relaxation techniques by exploring a constraint relaxation approach that replaces the
original resource constraints with an approximation. The Lagrangian Relaxation tech-
niques are faster and scale to more complex problems of the type addressed in search
theory, as shown in our analysis and simulation results. The constraint relaxation tech-
niques, coupled with Lagrangian Relaxation, enable hierarchical decomposition of cou-
95
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
1
2
3
4
5
6
7
8
π(0)
energy
First Stage Energy Allocation vs Prior Probability
Figure 5·16: Stage 0 energy allocation versus prior probability
pled problems into independent subproblems per location, loosely coordinated by a
scalar value that prices resources. Recovering feasibility can be accomplished through
solution of on-line re-optimization problems with little loss in performance, as shown
in the experiments. Our results provide near-optimal algorithms that scale to larger
problems with inhomogenous prior information, which makes them well-suited to RH
control approaches.
96
Chapter 6
Human-Robot Semi-Autonomous Systems
The final chapter in this dissertation is devoted to the subject of how to optimize in-
teractions between humans and robots working as a team. Some of the issues to be
discussed include: what tasks/roles are most appropriate for humans to do versus au-
tonomous agents, how machines can adapt to human operators and vice-versa, how a
semi-autonomous search and exploitation system can support time-varying levels of au-
tonomy that adapt to mission conditions, and the amount of information humans can
handle before succumbing to information overload.
After discussing these topics, we describe how models for human-machine interac-
tions can be empirically validated via the gaming medium. Hence, we design a strategy
game for the exploration of these issues. Our game design allows for various control
structures to be tested and for examining the performance of human supervisory control
when robots use the algorithms developed in the previous chapters for SM. This can
help determine summary statistics that robots should report to humans. The purpose
of this analysis is to determine which statistics/summary information can be used to
maximize human operator situational awareness.
6.1 Optimizing Human-Robot Team Performance
The forthcoming discussion summarizes various issues and trade-offs that are involved
in optimizing human and robot performance in a team environment.
97
6.1.1 Differences Between Human and Machine World Models
The goal of creating a semi-autonomous part-machine, part-human team with high per-
formance is non-trivial because humans do not share a common paradigm for reason-
ing that is recognizable to machines and vice-versa. Machine-reasoning systems make
decisions based on probabilities (belief-states), performance metrics, cost-sensitivity in-
formation, trade-offs between FAs and MDs, feature vectors generated from noisy sen-
sors and other quantitative metrics. Human decision-makers choose actions based on
intuitive threshold-levels, higher-level, amorphous contextual information, actual or per-
ceived patterns in the environment, arbitrary priorities and preferences and from ten-
dencies formed by habit. Humans tend to have a wealth of experience at their disposal
that is generally valuable and very difficult to parameterize for machine use.
Some means of mapping back and forth between human and machine-reasoning
frameworks for decision-making is necessary to allow machines to collaborate effectively
with human supervisors/team-members. While mathematical models are useful for char-
acterizing the most effective behaviors (actions) that robots can perform in various sit-
uations, it is difficult to quantify human actions in mathematical form. For this reason,
this chapter proposes empirical methods to evaluate the best means of characterizing
robot information-states for human consumption and for analyzing the sensitivity of
human performance to various types of information reported by robots.
6.1.2 Human Decision-Making Response Time
A trade-off is necessary between the amount of time allotted to human operators/team-
members for higher-level processing and the utility that such “soft-inputs” have in a
real-time decision-making system that operates in an uncertain, dynamic environment.
This is another instance of the search versus exploitation paradigm that applies to
human reasoning processes. If humans take too long to reach a good decision, the world
98
around them will have evolved into a new state that is not necessarily related to the
one for which a decision is to be made. If humans are forced to act too quickly, they
may not be able to outperform machine systems, and therefore lose their utility in a
semi-autonomous system. In order to close the loop and allow robots to collaborate
with humans, algorithms used for machine-reasoning need to incorporate an awareness
of the time-scale that humans require for choosing courses of action.
Tasking humans with decision-making at a time-scale that is at or near the limits
of their capability only detracts from human situational awareness and leaves humans
prone to making poor decisions. Time-scales appropriate for human interaction need to
be established as a function of current and projected near-term environmental complex-
ity and need to include the fixed-cost of asking a human operator to switch contexts as
well as the time-cost of analyzing a situation to reach a decision. The relative benefit
of human supervision must be weighed against how well a machine can perform inde-
pendently and an awareness of what else the human could have been doing with that
time. This problem is similar to a job-scheduling problem with stochastic setup and job-
completion times where the “servers” are human operators. (See [Leung, 2004,Sellers,
1996] for an overview of job-shop scheduling problems).
6.1.3 Human and Machine Strengths and Weaknesses
Whereas humans are capable of significant parallel processing (e.g. w.r.t. being aware
of and responding to numerous novel stimuli at the same time), they do so on a very
slow time-scale relative to autonomous agents. However, automata are typically built
to perform just a few select tasks very well and at high-speed; developing algorithms
that have generalization capability remains a serious challenge. Humans can currently
identify patterns and trends in complex environments that far exceed the ability of any
automaton to process. On the other hand, humans are fallible, can become bored or
99
confused and can act with predilection or supposition or short-sightedness. The larger
the task that a human is given, the more contextual information there is to be aware of,
the longer it takes human operators to switch between tasks [Cummings and Mitchell,
2008]. Therefore this is a design parameter, and tasks need to be divided up into
appropriately sized pieces to maximize human operator effectiveness.
One of the largest problems with incorporating input from human operators that
oversee multiple robotic vehicles is that it is easy to confuse contexts in switching be-
tween tasks and in so doing, to lose situational awareness [Cummings et al., 2005]. If
human agents become confused, they are rendered idle until re-situated. The higher
the complexity of the mission-space and the more dynamic the state-space, the more
likely such loss of awareness is to happen. However, this issue can be quantified and
anticipated, which allows for potential problems to be in large part averted. One
means of staving off these problems is by providing human operators with key fea-
tures/indicators when switching tasks such that information is presented to humans on
a need-to-know/prioritized basis (with a certain amount of redundancy).
There is a duality relationship between man and machine that can be exploited.
Humans can help automata perform better by adding domain-specific knowledge that
is not yet modeled (or feasible to model) in machine planning algorithms. Human par-
ticipants in the decision-making loop can employ higher level reasoning processes (e.g.
handling of contingency scenarios when a vehicle falls out of commission) to add robust-
ness and human experience to the search and exploitation process. A semi-autonomous
hybrid controller can take the input from both human and non-human controllers and
fuse it together in such a way as to leverage the strongest characteristics of each of the
component controllers.
100
6.1.4 Time-Varying Machine Autonomy
It is of practical interest to try and develop semi-autonomous object search and exploita-
tion algorithms that can operate at dynamic levels of autonomy according to mission
complexity [Bruemmer and Walton, 2003, Baker and Yanco, 2004, Schermerhorn and
Scheutz, 2009,Goodrich et al., 2001]. The goal is to keep human workload at an accept-
able level at the busiest of times while not forgoing the performance enhancements that
humans can provide a semi-autonomous control system at less busy times. In routine
conditions, a semi-autonomous system may be able to proceed on “autopilot” whereas in
dangerous situations, or ones in which there is a lot of uncertainty/rapidly changing en-
vironmental conditions, it may be advantageous for humans to step in and assume more
authority over UAV activities. It is obviously very important that any robot operating
in the field be able to quantify its degree of uncertainty about the external environment
and “understand” when its model for the world isn’t matching reality.
6.1.5 Machine Awareness of Human Inputs
The amount of input that a human operator can give to numerous vehicles communicat-
ing on a one-on-one basis is limited, so the duration and objective of these interactions
needs to be delimited by a protocol that is adaptive to circumstances [Cummings et al.,
2005]. Allowing humans to participate in search and exploitation missions requires de-
veloping sophisticated Graphical User Interfaces (GUI’s) that present a dynamic flow of
information that is filtered/emphasized in real-time according to the content of the data
being displayed. In order to best leverage humans’ cognitive abilities for prediction, in-
tuition and pattern-recognition, it is necessary to design software that takes account of
the relative importance of the information being displayed and how long a human will re-
quire in order to grasp and make use of that information [Kaupp and Makarenko, 2008].
If human operators do not have sufficient time to consider and react to information that
101
is displayed to them, that information becomes noise that impairs the completion of
other (feasible) tasks, and it should not be displayed [Cummings et al., 2005]. There
is also a potential schism that must be handled concerning the activities a human may
wish to focus on versus what a predictive decision-making algorithm considers to be the
most important issue for human operators to work on.
6.2 Control Structures for HRI
There are several degrees of freedom to be explored when it comes to the control struc-
tures that are used to regulate, route, sequence and validate decisions made by machines.
The number of robotic agents that a single human can oversee while acting as a supervi-
sor is a central issue to explore [Cummings and Mitchell, 2008,Steinfeld et al., 2006]. In
addition to using humans in a supervisory capacity, human operators can act as peers
within a semi-autonomous team. Humans can also assume control of robots in time-
varying fashion as a situation warrants. For instance 3 humans could oversee 9 robots
(in 3 subgroups) and a 4th
human could be tasked to assist one of the 3 subgroups
in time-varying fashion as needed. Up to a certain level of dynamism in the environ-
ment, this would allow all 3 subgroups to reap most of the benefits of having 2 human
decision-makers involved at all times [Nehme and Cummings, 2007].
In addition to exploring the best choice for optimal team size and composition, the
tasks that human/non-human operators perform best can vary over time. Tasks can
be partitioned according to several different static and dynamic strategies [Gerkey and
Matari´c, 2004]. As a first pass, a catalog of activities can be created, and then hu-
mans can be given a fixed subset of the activities that they perform well and robots the
complementary subset. Tasks can be partitioned between humans and machines based
upon geographical constraints, which makes sense if humans are navigating vehicles and
doing their own information-gathering. Tasks can preferentially be given to a human
102
(machine) decision-maker and robots (humans) can be used as a fall-back option. Dif-
ferent humans have heterogeneous capabilities and robots may as well. Task-allocations
should take such heterogeneity into account [Nehme and Cummings, 2007]. Tasks can
be partitioned based on a time-window threshold at which point humans will not have
enough time to accomplish the task, so therefore a robot must take responsibility. Tasks
can be allocated based on situational awareness metrics that machines use to gauge hu-
man preparedness [Cummings et al., 2005]. Decisions must be made in a collaborative
fashion, however, such that human and non-human decisions complement each other
versus impede each other [Dudenhoeffer, 2001]. A mixed initiative control mechanism is
necessary in order to incorporate both robot and human decisions into a single, unified
plan of action. Human operators may be able to quickly generate a few candidate plans
of action without being able to determine which one is best. If people can help “narrow
the playing field” for autonomous agents, the agents can concentrate their processing
power on looking more deeply into such a refined/limited set of strategies.
In as much as possible, it is desirable to use a policy-driven decision-making scheme
wherein a human makes a standing decision that remains in effect unless preempted by
another human or a more important decision. “Management by exception” is required
in order to support a system with one human controlling large numbers of robots [Cum-
mings and Morales, 2005]. Threshold policies and quotas concerning resource usage and
rate of usage are types of policies that both humans and non-humans can use to good
effect. We explore how these decision-making techniques can be exploited to relieve
the decision-making burden on human-operators but at the same time keep them as
informed as possible about the evolving state of a mission.
103
6.2.1 Verification of Machine Decisions
Autonomous agents have yet to become fully accepted as team-members alongside hu-
mans in a collaborative environment. This issue has been observed in search and rescue
settings as well as military ones. It takes time for people to gain confidence in new
technology [Cummings et al., 2005]. In order to facilitate the acceptance process of
autonomous agents, it is necessary to build machine-reasoning algorithms that not only
arrive at verifiably good decisions, but ones that are transparent to a human opera-
tor. The issue of not over-trusting or under-trusting automata is a key factor in system
performance [Freedy et al., 2007].
In short, we seek to consider the performance benefits of including one or more hu-
mans in the SM decision-making loop to guide the actions of semi-autonomous vehicles.
We seek to identify tasks that robots do not perform well, that are computationally
intractable for automata, or where machines would benefit from an external informa-
tion source. Incorporating human input increases robustness to model error, allows the
system to be capable of handling anomalies that are outside the scope of its design, and
adds a level of redundancy that ensures the robots remain grounded and on track at
all times. Conversely, having computerized feedback from robots can help the humans
remain situated at all times as well.
6.3 Strategy Game Design
Having discussed some of the issues concerned with human supervisory control of teams
of robots, this section describes how human operators may enhance the performance of
our search and exploitation system. In preceding chapters we have developed algorithms
that allow robots operating in dynamic environments of limited size to conduct near-
optimal search and exploitation activities. In order to create an algorithm for SM over
extended distances, we use a game that explores using a human supervisor to coordinate
104
the activities of multiple teams of robots operating in distinct areas.
Our game is a real-time strategy game with one human player and multiple UAVs.
The objective of the game is for the human to partition tasks between several teams
that are each composed of several UAVs. The game has a clock that counts down as
time elapses. We divide the mission space into two or more zones of operation where
the operator can task UAVs to (autonomously or semi-autonomously) search and exploit
objects. In the autonomous mode of operation the robots are fully responsible for how
sensing resources are expended in the search and exploitation process. In the semi-
autonomous mode of operation, UAVs use guidance from the human operator before
spending sensing resources (“management by consent”).
If the human operator tells a UAV to move between zones of operation, it takes the
UAV Td seconds to move to the new region, and during this time, the UAV is unable
to perform sensing tasks. Td is a design parameter for our game whose value can be
set to a large number to raise the level of commitment required to assign a UAV to a
new region or lowered in order to decrease the significance of moving a UAV to a new
region. While UAVs are in each zone of operation, they choose sensing tasks and expend
their sensing resources semi-/autonomously in executing sensing operations to the best
of their ability. The human operator oversees the consumption of resources by each UAV
and how the total expected classification cost of the objects in each zone changes over
time. This information can be used by the human operator to determine if it is time to
direct a UAV to move to a new region of operation.
The Graphical User Interface (GUI) for our game design is shown in Fig. 6·1. In
this figure, an instance of the game is portrayed in which there are two regions for UAV
sensing operations for the sake of simplicity. Each region consists of a grid of cells that
are color-coded according to the current belief-state for each cell. Seeing as there are
3 primary colors, this game allows up to 3 types of objects to be represented.
105
The game is shown in a state where one UAV is searching the left region, one UAV
is sensing the right region and one UAV is moving from the right to the left region.
The sensing resources held by each UAV are shown as a horizontal bar below each
UAV’s icon. The hourglass symbol beside the UAV that is placed between the two grids
indicates that this UAV is in transit and the operator must wait for that UAV to arrive
in the region on the left. Cells with ambiguous information states are shown with gray
or “muddy” colors, and cells that are well-classified have colors close to pure red, green
or blue.
The game displays several summary statistics below each grid/region/zone respec-
tively. The “Cost” value and horizontal bar shows the current expected classification
cost for the objects in the grid for each region. The “Bound” value and horizontal bar
displays the lower bound on the classification cost for the objects in a grid that is re-
turned from the column generation routine. The “Res” line displays the total sensing
resources that are available across all of the UAVs in each grid. The last item, “Delta”,
displays a new lower bound value that would result from giving the associated region
another allotment of ∆R resources. Moving UAVs between regions allows sensing re-
sources to be moved between each of the sensing zones in order to do load-balancing of
sensing resources.
We have designed this game with a server-client interface in mind. The client pro-
gram has in-game buttons to play/pause, stop and restart the simulation. Assuming
there are two sensing zones in the game, UAVs can be moved between zones by simply
clicking on their icons (assuming they are not already switching between regions). If
the game has more than two zones of operation, UAV icons can be dragged over a new
region to tell them to move there. The game uses a login screen to allow a player to log
on to the server that runs the back-end code (the sensor management system) before the
game begins; the clients are essentially thin clients. The server supports one client at a
106
Figure 6·1: Graphical User Interface (GUI) concept for semi-autonomous
search and exploitation strategy game.
107
time, and communication is based on a TCP/IP protocol. This design allows multiple
clients (multiple human players) to be involved in future game designs. After a game is
finished, the server can collect statistics on the behavior of each player and the client
program keeps track of this information while a game is in progress. The server uses
(continues to use) C++ and C code, whereas the client can be implemented using Qt,
Java, DirectX or even in MATLAB. The server has one thread for each client for com-
munication purposes and a dedicated thread for running computations (to create sensor
plans via column generation). The clients have one thread for communicating with the
server and one thread for handling user input (so that the process of communicating
with the server is independent of the process of communicating with the player).
This game design can be used to test our hypotheses concerning human performance.
There are multiple dimensions to be explored:
• statistics for situational awareness
• statistics for human decision-making response time
• statistics per operator for inclination to search versus exploit
• statistics per operator of how performance improves with gaming experience
• best ratio of UAVs per operator
• best ratio of the number of sensing zones per operator
• value of numerical versus graphical versus auditory feedback to guide operator
• best arrangement of GUI elements to maximize software intuitiveness
• relative value of policy-based heuristics (fully autonomous) versus human-guided
decision-making for moving platforms between regions
• operator overload as a function of:
– environmental complexity and dynamics
– simulation rate
108
– number of sensing regions/zones
– number of UAVs in the simulation
• human performance as a function of detail in simulation cost information:
– expected cost per region, resources per region only
– all of the above and projected cost per region using lower bound solution
– all of the above and projected cost sensitivity to additional resources
Drawing from all of these issues, we make the following hypotheses:
• human situational awareness is a function of:
1. number of robots per team
2. number of teams
3. number of zones for sensing operations
4. number of locations per zone
5. granularity of resource allocations
6. simulation rate
7. rate of environmental dynamics
• an optimum exists for the best number of robots per team and number of teams
per human operator
• human performance will degrade linearly with increasing simulation rate up until
a certain threshold and non-linearly thereafter
• situational awareness can be improved by using per-zone summary statistics that
describe the time-varying performance of each robotic team using visual and au-
ditory clues, and operator overload can be mitigated by simultaneously using au-
ditory and visual channels for delivering status information
• projections formed from our lower bound computations will increase the perfor-
mance of human planners in a statistically significant way
109
• management by consent policies for sensor resource allocation will have the best
performance up to a certain level of environmental complexity and simulation rate
at which point management by exception will be the better strategy
• human operators will not be able to effectively use per location probabilistic infor-
mation whether it is represented as colors or in some other form unless the game
is trivially simple
• humans will require playing the game 10 or more times to become proficient
• operator boredom will contribute to UAVs being assigned to new regions more
often than they should be
Our hypotheses concerning situational awareness can be tested using a fractional
factorial design of experiments set of simulations. The control experiment for how well
a human can interact with a team of robots can be handled by taking statistics on
the performance of a human operator working with a single robot in a single sensing
region. We can quantify how well the autonomous sensor planning algorithms perform by
comparing their performance within a single region of operation with that of a human
who is tasked to manually plan sensing operations for the same experimental setup.
In both cases Monte Carlo runs with various operators can be used to average out
experimental uncertainty concerning variability in human performance.
After collecting data across a number of simulation runs with multiple players, we
can statistically quantify the significance of each of these hypothetical performance fac-
tors. We can empirically determine the number of robots per human that maximizes
performance at the same time. We can study the relative value of summary statistics
for the state of the whole game and for each zone in the game by running simulations
in which different operators have different statistics exposed to them and by watching
how the performance of these operators differs. It is relatively straightforward to test
the utility of providing predicted game/simulation cost information to human players.
110
One question is just how useful this information will be, and a second question is how
this information can best be presented to an operator. A population of human operators
can play the game with and without cost predictions per zone and with this information
displayed in various ways to determine what the utility of the information is.
Our intuition is that game players will perform better having cost-sensitivity infor-
mation available while making resource allocation decisions. We think that playing the
game using the lower bound on classification cost will also be an advantage to players
versus playing the game with current expected classification cost statistics alone. We
believe that per UAV resource information for a single-dimensional resource pool will
be useful for operators, but that information on multi-dimensional resource pools will
overload operators.
This game design can be used as a test-bed for future work in the domain of semi-
autonomous human and machine teams. In this chapter we have highlighted some of
the key issues that are involved in the design of an effective hybrid control system and
prescribed this computer game as a means of exploring the various trade-offs involved.
In the first iteration, we envision a single-player game, but in successive versions, we
anticipate no difficulty in incorporating input from multiple human players using the
server-client paradigm. With the gaming medium, it should be possible to explore all
of the issues concerning how humans interact with humans and humans interact with
machines in such a hybrid, multi-player environment. After a multi-player version of this
game has been implemented, it can be used to model realistic search and exploitation
scenarios similar to those found in the field.
111
Chapter 7
Conclusion
Viewed from a high-level, this dissertation seeks to address outstanding problems in the
domain of optimal search theory with follow-up actions, the trade-off between search
versus exploitation and human-computer relations/human-factors. Within this context,
we have presented algorithms that allow large, combinatorially complex problems to
be broken up into subproblems of a tractable size that can be solved in real-time. We
perform these hierarchical decompositions using Lagrangian Relaxation and Column
Generation to coordinate the solutions of independently-solved subproblems without
losing the fidelity represented in subproblem solutions.
7.1 Summary of Contributions
In one of the most important contributions of this dissertation, Ch. 3 describes novel
techniques for RH control algorithms based on mixed strategies and a lower bound for
sensing performance that was developed in [Casta˜n´on, 2005a,Casta˜n´on, 2005b]. These
strategies consider near-optimal (non-myopic, adaptive) allocation schemes for a set of
noisy, multi-modal, heterogeneous sensors to detect and classify objects in an unknown
environment in the face of resource constraints using a centralized control algorithm.
We consider mixed strategies for sensing that employ a handful of possible sensor modes
and a discrete set of measurement symbols with deep (far-seeing) decision-trees.
A C++/C-language simulator was constructed to implement our RH control algo-
rithms, and simulations using fractional factorial design of experiments were performed.
112
Differences in sensor geographical placement, sensor capabilities, sensor resource lev-
els, planning horizon, and the relative cost of FAs and MDs were considered. We have
demonstrated that at least in the simulation scenarios considered, the use of a pure strat-
egy for RH control that minimizes expected resource usage (subject to the constraint
that sensing activities are performed) has near-optimal performance.
We describe the extension of search functionality to an algorithm that was previously
used solely for object classification Section 2.2.2. The Search versus Exploitation trade-
off is an important aspect of the SM problem that we address near-optimally for the SM
problem formulation considered.
Another contribution of this dissertation is presented in Ch. 4 in which two possible
extensions to the problem formulation of Ch. 3 are developed. The first such extension
describes the theoretical basis whereby SM can be conducted in a dynamic environment
made up of a set of N locations with independent but time-varying states. These state
dynamics are represented with HMMs, and a lemma is provided to show that even
without a time-invariant state, it is possible to decouple subproblems by making use
of time-varying Lagrange multipliers and an expanded Column Generation algorithm.
In an alternative extension, we describe how problems with known but time-varying
visibility can be modeled as well, by solving the resource allocation problem in terms of
strategies that mix between resource use per sensor per time. Locations with known,
time-varying visibility are germaine to such applications as remote-sensing with satellites
following predictable trajectories.
In Ch. 5, we consider an alternative formulation for SM in a detection context that
uses a continuous action space and observation space (Gaussian Mixture Model) with a
two-stage sensing horizon. DP and Finite-Grid techniques are used to optimally solve
sensing subproblems and a Line-Search is used to find the optimal price of sensor re-
sources. Using these techniques for problem decomposition, we near-optimally solve a
113
more general version of the problem that was posed by [Bashan et al., 2007] in roughly
two orders of magnitude less computing time. First and foremost we avoid performing
N dimensional grid-searches while looking for the optimal per-location sensing energy
allocations. Lagrangian Relaxation and Duality theory are harnessed for this purpose.
We make the argument that these decomposition methods as proposed by [Yost and
Washburn, 2000, Casta˜n´on, 2005a, Casta˜n´on, 2005b] can be applied to a wide-range of
problems involving search, classification, (sensor) scheduling and assignment problems
of which Ch. 3 and Ch. 5 provide prototypical examples.
The final contribution of this dissertation consists of the design of a game that ex-
plores issues surrounding the best means of allowing humans (operators) to input feed-
back into a semi-autonomous system that performs search+exploitation functions with
the ultimate goal of developing a near-optimal, mixed initiative, human+robot search
team that leverages the strengths of machine algorithms (model predictive control with
scripted intelligence) and human intelligence (real-time feedback and adaptation). We
propose the design of this game as a means of empirically measuring the most informa-
tive type of GUI interface for human operators that maximizes situational awareness
and minimizes operator workload. Such a design allows more robots to be controlled
per human operator.
7.2 Directions for Future Research
There are numerous directions for future work in the domain of SM. First of all, the
algorithms we have proposed for RH control using time-varying Lagrange multipliers
could be implemented. Column generation is known to have slow convergence properties
for large problem instances, so the question of how these algorithms scale in the context
of strategies that randomize sensor utilization over sensors and time (i.e. many more
multipliers) is of interest.
114
The implementation of a game such as the one we have designed to explore the
best framework for human supervisory control of robots is another important direction
of inquiry. The ultimate goal being the creation of a mixed-initiative system wherein
humans are maximally aware of what the automata are doing, robots are well-situated
w.r.t. the situational awareness of their human operators, multiple human operators
are able to communicate essential information between themselves, and all parties are
continually tasked with activities that they are well-suited to perform. Robots that
support adaptive autonomy levels are a topic of much interest recently. Ideally automata
will be self-sufficient when humans are already over-loaded with other tasks but still
designed to be able to incorporate more fine-grained human input when it is available.
Robots that are managed by exception and that do not need explicit instructions for
each and every task they perform are a long term goal in the Human-Robot Interaction
domain and in the domain of human-assisted search and exploitation.
A system for search and exploitation that jointly performs near-optimal SM and
path-planning would be a direct though non-trivial extension of this research. We are
interested in a problem paradigm that does not attribute value to moving a sensor to a lo-
cation, but moving a sensor to be within sight of a location. After developing a tractable
algorithm for SM with path-planning, a paradigm with risk of platform loss/malfunction
can be considered. A near-optimal and adaptive algorithm for decentralized SM would
be a significant contribution as well. Tractable algorithms for near-optimal SM with
moving targets are interesting and difficult problems to work on.
Higher resolution sensor models (more subproblem resolution [Jenkins, 2010]) and
support for correlated sensor observations could be investigated. The models we have
discussed in this work are appropriate for sensors that make observations with a narrow
FOV (e.g. an electro-optical camera with a telephoto lens). Alternatively, research could
be conducted for problems where sensors make observations over extended areas at the
115
same time, which introduces a data-association problem.
Our algorithms have assumed there is no correlation between the states of various
locations. This assumption was a requirement in order to make decomposition tech-
niques possible. Additional research work is needed to create tractable algorithms that
support correlation of object states across locations.
PBVI techniques yield to solution via parellelization methods. Specialized computing
hardware such as NVIDIA Tesla GPUs can be leveraged for this purpose to create real-
time SM algorithms for problems of realistic size. Also, reinforcement learning and
neurodynamic programming methods can be used to generate off-line value function
approximations. In future work, approximate and general value functions could be
computed off-line, stored and then used to seed solutions of online algorithms.
116
Appendix A
Background Theory
In this appendix we briefly overview some of the theory and concepts used in this
dissertation. We first summarize the definition of a POMDP model and describe how
the Witness Algorithm or PBVI can be used to solve POMDPs. Next, we discuss
Dantzig-Wolfe Decompositions and Column Generation as a special case.
A.1 Partially Observable Markov Decision Processes
A Markov Decision Process (MDP) is a dynamic decision problem where the underlying
state evolution is modeled by a Markov process, controlled by the decisions, and the state
is perfectly observed and used as the basis for making adaptive decisions. There are a
wide variety of uses for MDPs, some of which are discussed in [Bertsekas, 2007]. Two
dissertations focusing on MDPs and their applications are [Patrascu, 2004, McMahan,
2006]. DP can be used to find optimal adaptive strategies to control a system whose
state evolves in discrete time according to MDPs, assuming we know the parameters of
the MDP model.
An MDP model, despite its usefulness, is not sufficiently powerful in its descriptive
ability to represent our SM problems. In our problems, we do not have access to the full
state information but only to noisy measurements of the state. As a generalization to
an MDP, a Partially Observable Markov Decision Process (POMDP) [Monahan, 1982]
is an MDP in which only probabilistic information is available regarding the state of
the world. This probabilistic information is summarized into a sufficient statistic called
117
a “belief” or “information-state”. In a POMDP model, when an agent takes an action
it also receives an observation that is probabilistically related to the true state of the
world, and these observations can be used along with Bayesian inferencing to learn
about the system under observation. The underlying information state of a POMDP at
a particular stage is a belief-state, corresponding to the conditional probability of the
world state (aka core state/underlying state) given all past observations.
Formally, a POMDP is composed of the n-tuple (Xt, Ut, Ot, π1) along with the
functions T : Xt × Ut → Xt+1, Yt : Xt × Ut → Ot and Rt : Xt × Ut → ℜ where these sets
and functions are defined as follows:
• Xt the set of possible discrete states at state t
• Ut the set of possible sensor actions at stage t (finite-dimensional)
• Ot the set of possible observations at stage t (finite-dimensional)
• π1 the initial belief-state
• T the state transition function (Markov) with:
T (xt, ut) ≡ πt+1 =
diag{P(o(t)|x(t) = k, ut)}πt
1T diag{P(o(t)|x(t) = k, ut)}πt
• Yt = P(ot|xt, ut), the observation function that relates the sensor action to the environ-
ment being observed
• Rt = rt(xt, ut), the cost/reward function which gives the immediate cost/reward of a
sensor action from a particular state
for a T state problem where t = [1, . . . , T]. In general, the objective of a POMDP
problem is to select a policy γ that minimizes/maximizes:
E
γ
RT (xT , uT ) +
T−1
t=1
Rt(xt, ut)
where the policy γt : Xt → Ut and γ = {γ1, . . . , γT }. Using Bellman’s Principle of
118
Optimality, a DP cost-to-go function can be written as:
V ∗
(πt, t) = min
ut∈Ut
< Rt(ut), πt > +
ot∈Ot
V ∗
(T (πt, ut, ot), t + 1) P(ot|It, ut)
where the quantities It represent the information history (the set of previous actions and
observations) up until stage t and πt is the belief-state at time t (a sufficient statistic for
It). The inner product < Rt(ut), πt > represents the expected immediate cost/reward
for being in belief-state πt and selecting action ut. P(ot|It, ut) is given by:
P(ot|It, ut) ≡ P(ot|πt, ut) =
x′∈Xt
Y(ot|x′
, ut)πt(x′
)
A solution to a POMDP problem has two components. First of all, a value function
is constructed which gives the optimal reward/cost as a function of the belief-state.
This value function can be used to compute a policy (aka decision-tree) that gives the
optimal course of action for the associated belief-state at that stage. For finite-horizon
problems, there is a (generally distinct) value function associated with every decision-
making stage and the optimal policy is time-varying. For infinite horizon problems,
there is just one value function and the policy is stationary (after convergence to the
optimal policy). This makes it much more complicated to solve finite-horizon POMDPs
versus infinite horizon POMDPs. See Fig. A·1 for an example of an optimal value
function. In this example there are two possible states for each location: X = {empty,
occupied}. The hypothesis H2 corresponds to the decision that an object is present at
location i and H1 indicates location i is empty: Pr(xi) = Pr(H2) = Pr(occupied) and
Pr(H1) = 1.0 − Pr(H2). There is one sensor with the generic mode “Measure”. The
nodes in this figure have a one-to-one relationship with the nodes in Fig. A·2. In this
example, the cost of the measurement action was 0.2 units and the cost of a classification
error is 1 unit. The optimal value function as given by the DP Value Iteration method
119
Figure A·1: Hyperplanes representing the optimal Value Function
(cost framework) for the canonical Wald Problem [Wald, 1945] with hori-
zon 3 (2 sensing opportunities and a declaration) for the equal missed
detection and false alarm cost case: FA=MD.
(or approximately given by another method) is the concave hull of these hyperplanes.
The classification costs (dependent on P(H2)) give the hyperplanes their slope. The
measurement cost raises the level of the hyperplanes but does not change their slope.
The optimal value function can be written using hyperplanes called α-vectors as a
basis [Smallwood and Sondik, 1973]. The concave (convex) hull of the vectors for a
cost (reward) function gives the optimal cost-to-go value for a particular belief-state
Figure A·2: Decision-tree for the Wald Problem. This figure goes with
Fig. A·1.
120
(probability vector) π:
Vt(π) = min
α∈Vt
x∈X
α(x)π(x) (A.1)
Using this set of hyperplanes, the value function backup operation V = HV ′
can be
performed in four steps as follows:
1. First the intermediate sets Γu,∗
and Γu,o
are required ∀ u ∈ U and ∀ o ∈ O:
Γu,∗
← αu,∗
(x) = R(x, u) (A.2)
where R(x, u) is the reward (or cost) received for executing action u in state x.
2.
Γu,o
← αu,o
i (x) = β
x′∈X
T (x, u, x′
)Y(o, x′
, u)α′
i(x′
), ∀ α′
i ∈ V ′
(A.3)
where T (x, u, x′
) is the transition probability function from state x to state x′
and Y(o, x′
, u) is the likelihood of observation o given state x′
and action u. The
variable β is a discount-factor for infinite horizon DP, for finite-horizons it is set
to 1.0.
3. The next step is to create the sets Γu
∀u ∈ U. This represents the cross-sum of the
observations and includes one alpha-vector αu,o
from each Γu,oz
for z ∈ {1, |O|} :
Γu
= Γu,∗
⊕ Γu,o1
⊕ Γu,o2
⊕ . . . ⊕ Γu,oL
(A.4)
where L = |O| and the symbol ⊕ represents a cross-sum operator.
4. The last step is to take the union of the sets Γu
which are known as “Q-factors”:
V = ∪u∈U Γu
(A.5)
The value Γu
represents the optimal cost-to-go provided that the first action is action u,
121
i.e., it is a branch of the optimal decision-tree with one less stage to go. See [Pineau
et al., 2003] or [Kaelbling et al., 1998] for more details about this formulation.
If there are |V ′
| α-vectors in the previous basis for the optimal value function, the
first step generates |U||O||V ′
| projections. The second step then generates |U||V ′
||O|
cross-
sums. Therefore, although in practice many of the vectors that are generated are domi-
nated (and therefore pruned out of the solution set), it is theoretically possible to have
|U||V ′
||O|
vectors in the value function for V , with order |X|2
|U||V ′
||O|
time-complexity.
There is an exponential growth within a single backup to the number of hyperplanes
supporting the concave (or convex) hull in the previous stage, and every one of these
new hyperplanes will become part of the problem (burden) just one stage later.
Fig. A·3 and Fig. A·4 demonstrate how the complexity of the structure of the convex
(concave) hull of the set of hyperplanes representing the optimal value function grows
with increasing dimension of the state, and this is a relatively simple example with
just 4 possible states. In general after forming the projections, the hyperplanes that are
dominated (out-performed) by other hyperplanes must be pruned out of the solution set,
and testing every hyperplane against every other hyperplane (for instance, by solving
a LP) is a time-consuming operation. Solving a single LP is of polynomial complexity,
but solving an exponentially growing number of them is exponentially complex.
Monahan provides a survey of POMDP applications and various solution tech-
niques [Monahan, 1982]. Sondik’s One-Pass Algorithm [Smallwood and Sondik, 1973]
was the first exact algorithm proposed for solving POMDPs. Michael Littman’s Wit-
ness Algorithm [Littman, 1994] is a more recent POMDP algorithm that has computa-
tional benefits over Sondik’s One-Pass Algorithm. The Witness Algorithm maintains a
set of hyperplanes to represent the optimal value function in a POMDP and then sys-
tematically generates and prunes the possible next-stage hyperplanes in performing the
DP backup (backwards recursion) operation until an optimal value function is found.
122
Figure A·3: Example of 3D hyperplanes for a value func-
tion (using a reward formulation for visual clarity) for X =
{‘military’,‘truck’,‘car’,‘empty’}, S = 1, M = 3 for a horizon 3 problem.
The cost coefficients for the non-military vehicles were added together to
create the 3D plot. This figure and Fig. A·4 are a mixed-strategy pair.
123
Figure A·4: Example of 3D hyperplanes representing the optimal value
function returned by Value Iteration. The optimal value is the convex hull
of these hyperplanes. This figure and Fig. A·3 are a mixed-strategy pair
(see Section 2.3).
124
The optimal value function provides (expected) cost-to-go information for every possible
belief-state and thus provides all the information necessary to select an optimal action
given a particular belief-state.
A.2 Point-Based Value Iteration
Pineau recently developed a POMDP algorithm called Point-Based Value Iteration or
PBVI that uses sampling to generate near-optimal policies [Pineau et al., 2003]. PBVI
samples belief-space and maintains a record of the hyperplane with the best value (i.e.,
the best-available action) for every belief-point (sample point). The difference between
Finite Grid Methods and PBVI is that the former only keeps track of the best value at a
belief point whereas the latter keeps track of the best hyperplane, which is enough to be
able to reconstruct an approximation of the optimal value function in the neighborhood
of the belief-point. When the belief-space can be sampled densely, Finite Grid Methods
and PBVI give very good (or perfect) solutions; however, this is intractable in high-
dimensional belief-spaces. In [Lovejoy, 1991a, Lovejoy, 1991b] a Finite Grid Method is
proposed in which a Freudenthal triangulation is used to tessellate the state-space of the
underlying MDP which gives (M +n−1)!/(M!(n−1)!) possible belief-points where n is
the number of MDP states and M is the number of samples in each dimension. In gen-
eral, the PBVI technique scales much better complexity-wise than the other techniques
for solving POMDPs and can be used to solve much larger problems near-optimally.
Assume that the set B is a finite set of belief-points for PBVI. There will be one
(optimal) α-vector computed for every belief-point in B. Using the PBVI algorithm,
the approximate value function backup operation V = ˜HV ′
can be performed with the
steps:
1.
Γu,∗
← αu,∗
(x) = R(x, u) (A.6)
125
2.
Γu,o
← αu,o
i (x) = γ
x′∈X
T (x, u, x′
)Y(o, x′
, u)α′
i(x′
), ∀ α′
i ∈ V ′
(A.7)
3. Using the finite set of belief-points B, the cross-sum step Eq. A.4 is much simpler:
Γa
b = Γu,∗
+
o∈O
arg max
α∈Γu,o
(α · b), ∀ b ∈ B, ∀ u ∈ U (A.8)
4. The last step is to find the best action for each belief-point:
V ← arg max
Γu
b , ∀ u∈U
(Γu
b · b), ∀ b ∈ B (A.9)
As is the case with the exact value backup in Eq. A.2 - Eq. A.5, the PBVI routine
creates |U||O||V ′
| projections. However, the support for the value function V is limited
in the number of hyperplanes it can have to the size of |B| with a computational time
complexity on the order of |X||U||V ′
||O||B|. Even more importantly, the number of
hyperplanes does not “blow up” from one stage to the next as it does with the exact
backup operation. Pineau gives more details to this derivation in [Pineau et al., 2003].
A.3 Dantzig-Wolfe Decomposition and Column Generation for
LPs
In our work, we use decomposition techniques to break multi-location problems into
single location problems, coordinated by a master problem. This approach is known
as a Dantzig-Wolfe decomposition. Consider the following LP from [Bertsimas and
126
Tsitsiklis, 1997]:
min c T
1 x1 + c T
2 x2 (A.10)
subject to D1 x1 + D2 x2 = b0
F1 x1 = b1
F2 x2 = b2
where x1 ≥ 0 and x2 ≥ 0 and Fi are linear constraints that specify a polyhedral set
of feasible points. The latter two constraints in Eq. A.10 are not coupling constraints,
but the first constraint (with the Di matrices) couples the optimal values of x1 and x2
together. Define Pi as the polyhedra describing the set of all xi such that Fi xi = bi for
i ∈ {1, 2}. We can rewrite Eq. A.10 as:
min c T
1 x1 + c T
2 x2 (A.11)
subject to D1 x1 + D2 x2 = b0
with x1 ∈ P1 and x2 ∈ P2. Now using the Resolution Theorem for Convex Polyhedra, the
variables x1 and x2 can be written in terms of a basis of extreme points and extreme
rays. Assume there are Ji extreme points and Ki extreme rays in the ith
polyhedra.
Let the vectors xj
i for j ∈ Ji represent the extreme points of the polyhedra Pi. The
vectors wk
i represent the extreme rays in the polyhedra Pi for k ∈ Ki. Obviously for
bounded polyhedra the number of extreme rays is 0. The variables xi can now be written
in the form:
xi =
j∈Ji
λj
i xj
i +
k∈Ki
θk
i wk
i
127
with the bounds λj
i ≥ 0 ∀ i, j and θk
i ≥ 0 ∀ i, k and with a simplex constraint on the λj
i
values (we only want to allow convex combinations of the extreme points):
j∈Ji
λj
i = 1 ∀ i ∈ {1, 2}
This substitution results in the constraints:
j∈J1
λj
1
D1xj
1
1
0
+
j∈J2
λj
2
D2xj
2
0
1
+
k∈K1
θk
1
D1wk
1
0
0
+
k∈K2
θk
2
D2wk
2
0
0
=
b0
1
1
(A.12)
where the optimization variables are now in terms of λj
i and θk
i . For a general LP of the
form min cT
x subject to Ax = b, the reduced cost of variable xi is written as cT
− pT
Ai
where p is a Lagrange multiplier vector (dual variable). Therefore relative to this new
problem structure, the reduced costs for λj
i can be written as:
c T
i xj
i − [ qT ri1 ri2 ]
Dixj
i
1
0
= c T
i − qT
Di xj
i − ri1 (A.13)
and the reduced costs for θk
i are:
c T
i wk
i − [ qT ri1 ri2 ]
Diwk
i
0
0
= c T
i − qT
Di wk
i (A.14)
where the vector pT
= [ qT ri1 ri2 ] represents an augmented price vector (Lagrange
multiplier) that gives the price of violating constraints in this primal problem. The
Revised Simplex Method naturally supplies this type of pricing information as part of
its solution procedure, so no extra calculation is necessary.
Now, rather than trying to enumerate all of the reduced costs for the possibly very
large number of λj
i and θk
i variables, the best reduced cost (corresponding to whichever
non-basic variable would be most valuable to have in the basis) can be found by solving
128
the auxiliary (related) LPs:
min c T
i − qT
Di xi (A.15)
subject to xi ∈ Pi
for each of the subproblems. Using a Column Generation procedure, if the LP solution
for subproblem i has an optimal cost that is smaller than ri1 and finite, then we have
identified an extreme point xj
i that implies the reduced cost of λj
i is negative. Therefore
a new column Dixj
i 1 0
T
for the variable λj
i is generated and added to the master
problem. If the LP solution for subproblem i is unbounded, we have identified an
extreme ray wk
i that implies the reduced cost of θk
i is negative. Therefore a new column
[ Diwk
i 0 0 ]
T
for the variable θk
i is generated and added to the master problem. If the
optimal cost corresponding to the ith
LP is no smaller than ri1 ∀ i, then an optimal
solution to the original (master) problem has been found. Clearly, there is nothing
limiting this formulation to just two subproblems; the only limit to the number of
subproblems is the available amount of computing time.
The application of a Dantzig-Wolfe Decomposition to a linear programming problem
results in the method of Column Generation. This technique can be used to solve large
systems of linear equations containing so many variables and constraints that they do
not fit inside computer memory (or in some cases can not even be enumerated). Consider
a system of equations:
min cT
x
subject to Ax = b
x ≥ 0 (A.16)
where x ∈ ℜn
, c ∈ ℜn
, b ∈ ℜm
and the matrix A is m × n with entries aij ∈ ℜ. Let Ai
129
denote the ith
column of A. In situations where the matrix A is so large that it may not be
feasible to evaluate the product Ax, it is still possible to build up to the optimal solution
of cT
x iteratively. Let us assume that m ≪ n. An iterative solution may be constructed
by using the Revised Simplex Method that just keeps track of the columns Ai of A
that are basic (that have support) along with the corresponding values xi. The basis is
initialized by either using m artificial variables (with large cost coefficients so they will be
driven out of the basis) or with another heuristical method. With every iteration of this
procedure, one column is added to the basis (hence the name “Column Generation”),
so the basis keeps growing in dimension. Let Ik represent an index set of column indices
of A in the basis up to (but not including) the kth
iteration of Column Generation. On
iteration k, Column Generation solves a Restricted Master Problem:
min c T
Ik
xIk
(A.17)
subject to AIk
xIk
= b
xIk
≥ 0
according to the basic variables xi ∀ i ∈ Ik and determines whether or not there are
any negative reduced-costs associated with these basic columns. (A negative reduced-
cost indicates the solution is not yet optimal). Each time a negative reduced-cost is
found, k is incremented by 1 and a new basic variable (for the column that had the
negative reduced-cost) is added to the set Ik. If no non-negative reduced-costs are
found, an optimal solution to the original LP Eq. A.16 has been found. Therefore
the Column Generation method iteratively builds up a solution in a larger and larger
subspace of the original space ℜn
until an optimal solution is found and this can be
done even when n = ∞! It is, of course, necessary that the subproblems can be solved
in a reasonable amount of time or else iteratively solving subproblems in this fashion is
not helpful. In situations where there are many constraints and a tractable number of
130
variables, the Cutting Plane Method can be applied to the dual problem. See [Bertsimas
and Tsitsiklis, 1997] for the full derivation of this material. Williams demonstrates how
Column Generation can be applied to a SM problem in his dissertation [Williams, 2007].
The dissertation [Tebboth, 2001] describes Dantzig-Wolfe Decompositions in more detail
including how columns can be generated in parallel to speed up the solution process.
131
Appendix B
Documentation for column gen Simulator
The purpose of this appendix is to give a high-level overview as to the function of the
column gen simulator such that a 3rd party could pick up the program and use it for their
own simulations and/or continue to develop the code-base. The style of this appendix
is less formal than the rest of the dissertation. This program was built off of the base-
line established by Anthony R. (Tony) Cassandra while he did his dissertation work at
Brown University. His program pomdp-solve-5.3 provides a lot of the core functionality
of this simulator [Cassandra, 1999]. Nonetheless, I spent several years working with his
code base, developing and customizing it and in general had to make it work for me.
There were something like 60–80K lines of code in Tony’s program before I got to it,
and I added around 10–15K, a significant portion of the work went to understanding
and rearranging what was already there. Whereas Tony’s interests were typically in
the framework of a batch-mode execution of a POMDP solver using an infinite horizon
problem formulation, the concerns of this dissertation necessitated a program that could
execute many POMDPs in a loop (unimpeded by file-io). In addition these POMDPs
needed to have customizable parameters from one iteration to the next for such things as
variable sensor resource costs. Later on after beginning to work with “visibility groups”,
the POMDP Subproblems also needed to be able to support having a separate action-
space for each Subproblem according to the particular set of sensors that had visibility
for this Subproblem. Breaking Tony’s batch-mode program into a more modular form
able to run in a loop, learning what needed to be reinitialized or not, learning how
132
to create my own data structures following his conventions, etc. . . took significant (3+
person-months of) effort (before starting on the Column Generation algorithm). There
were actually a couple different bugs that came up while using a finite-horizon POMDP
model that had to be fixed (and of course were not relevant to the infinite horizon
problem and therefore were able to escape his notice). The simulator portion of the
program is C++ code while the planning code and the interface with pomdp-solve-5.3
is all written in C code, ‘extern “C” {}’ statements are used to allow this to work.
Unfortunately, I found that Tony’s POMDP solver did not work correctly while solving
models using a cost formulation. Therefore I solved all POMDPs in this program using
a reward formulation and had to convert at the interface of my code with his.
B.1 Build Environment
In its current form, the column gen program has 3 different use-cases that I have been
switching between by using a couple “#if 1 (or 0)” type preprocessor directives in the
main.cpp file. This is rather primitive, but it was supposed to be a temporary solution
while evaluating what the final use-cases for the program will be. The “#if 1 (or 0)” pre-
processor flag in main.cpp, line 145, controls whether or not the C++ simulator will be
executed or whether or not the ColumnGenDriver() C-language program/subroutine will
be executed. The driver basically just runs the Column Generation planning algorithm
for a series of different inputs and is useful for creating graphs of the lower bound as a
function of various input values. The “#if 1 (or 0)” flag in main.cpp, line 161, controls
whether or not the full range of simulations will be run for all the different combinations
of resource levels, horizons, MD to FA ratios and simulation modes (the default case,
currently this entails 34
∗ 100 simulation runs) or if just one batch of 100 simulations
will be run. The latter case I have used (in conjunction with the appropriate values
of seeds for the random number generator (see below) and appropriate values for each
133
of these design parameters) to jump-start the simulator in a particular state wherein it
was crashing, so I could debug the problem.
The GNU Integrated Development Environment (IDE) “KDevelop ver 3.5.3” using
“KDE ver 3.5.10” was used to develop this software. The parameters used to configure
the project from within the KDevelop->Project->Project Options are the following (for
a debug configuration):
• Configure Options (General):
– Configure arguments: –enable-debug=full (this param is generating warning
with Autoconf, needs attention)
– Build directory: debug
– Top source directory: (blank)
– C/C++ preprocessor flags (CPPFLAGS): -D DEBUG
– Linker flags (LDFLAGS): (blank)
Configure Options (C):
– C compiler: GNU C Compiler
– Compiler command (CC): gcc
– Compiler flags (CFLAGS): -O0 -g3 -L../lpsolve55
Configure Options (C++):
– C++ compiler: GNU C++ Compiler
– Compiler command (CXX): g++
– Compiler flags (CXXFLAGS): (blank)
• Run Options (“Main Program” check-box is checked):
134
– Executable: column gen/debug
– Run Arguments: (as given in the previous paragraph)
– Debug Arguments: (as given in the previous paragraph but without the
redirection-to-file operator)
– Working Directory: column gen/debug/src
• Debug Options:
– Debugger executable: /usr/bin/
– Debugging shell: libtool
– Options: Display static members, Display demangled names, Try setting
breakpoints on library loading
– Start Debugger With: Framestack
This project currently uses the Automake and Autoconf build tools for better or worse.
In order to add new files to the project the Makefile.am files must be edited. These files
get compiled into Makefile.in files which eventually get turned into Makefiles. The files
“Makefile” themselves are temporary/expendable in nature and should not be edited.
There was a problem with the ltmain.sh script that is used to configure the project, it
seems that it’s referring to the wrong version of the “libtool” script that is in a system
folder, in consequence I was originally having problems getting this project to compile
with a computer running Linux (Ubuntu 9.04). After determining that the build errors
had something to do with improper (inconsistent) versions of the different Autoconf
and Automake tools getting run, I manually copied over the ltmain.sh script with one
I had from the BU Stormy (Fedora core 4) OS. This was an ugly fix, but solved the
problem. The whole build process needs to be reworked, and preferably moved away
135
from Autoconf and Automake. Or else someone more familiar with these tools could
get in and get these tools working nicely together.
I have not been using the Build->Install functionality from within KDevelop, just
Build->run automake & friends followed by Build->Build Project. Occasionally I have
had problems with the KDevelop environment getting stuck in some kind of intermediate
state with the Automake tools and I have run “make distclean” from the command line
(from the main project directory) and gotten rid of the old Makefiles and cached build
information. (It does not cause any harm to delete the debug subdirectory altogether).
Then I have run the Automake tools again and rebuilt the project with the new Makefile.
The Automake tools actually create Makefiles out of Makefiles!
For the record, the main files that I created + frequently used while working on this
project are the following:
• main.cpp
• simulator.cpp/.hpp
• vehicle.cpp/.hpp
• task.cpp/.hpp
• grid.cpp/.hpp
• cell.cpp/.hpp
• column gen.c/.h
• pomdp solve 5 3.c/.h
• global.h
• MyMacros.h
These files are found in the /src directory of the project. I also worked with the files:
• pg.c/.h
• pomdp.c/.h
136
• alpha.c/.h
• mdp.c/.h
• imm-reward.c/.h
some of which are in the same /src directory and some of which are in the /src/mdp
subdirectory. Other files were modified only on rare occasions. By convention C language
files were given suffices .c and .h and C++ language files were given suffices .cpp and
.hpp.
The following MATLAB scripts are helpful for either displaying simulation results
or debugging such things as the evolution of a belief on a decision-tree:
• DisplayPolicyGraph.m: uses MATLAB’s biograph viewer to display a color-coded
decision-tree
• ReadHyperplanesFromFile.m: helper file for DisplayPolicyGraph.m
• ReadPolicyGraphLine.m: helper file for DisplayPolicyGraph.m
• PlotLowerBoundPaperResults.m: plots the ROC curve of Fig. 2·4
• PlotValueFunction3D.m: plots the 3D value functions used in Appendix A.1
• PlotValueFunction.m: plots the 2D value functions such as in Fig. A·1
• BeliefEvolution LowerBoundPaper.m: belief-evolution test-case relevant to the
simulation results in [Casta˜n´on, 2005a]
• test case J measure J total mismatch.m: belief-evolution test-case of Fig. 2·6
• pg tree y1 y2 calcs.m: belief-evolution test-case related to Fig. A·1
B.2 Running column gen
In this project the source files are all in the column gen/src subdirectory and its subdirec-
tories. There is a limited amount of documentation in the column gen/docs subdirectory
and all of the data (as well as figures, some MATLAB scripts etc. . . ) are stored in the
137
column gen/sensor management subdirectory.
The column gen program can be executed with one of the following commands (the
paths in these commands assume the program is launched from the column gen /de-
bug/src directory):
1. Search and Exploit Variation:
./column gen -dom check true -method grid -fg type search -fg purge domonly -
proj purge domonly -fg epsilon 1.0e-9 -fg points 1000 -scenario filename
../../sensor management/searchAndExploit ver3.data -pomdp
../../sensor management/searchAndExploit ver3.POMDP
>../../sensor management/simulatorOutputFileSearchAndExploit.txt
2. Lower Bound Paper Variation:
./column gen -dom check true -method grid -fg type search -fg purge domonly -
proj purge domonly -fg epsilon 1.0e-9 -fg points 1000 -scenario filename
../../sensor management/lwrBoundPaperColumnGen ver3.data -pomdp
../../sensor management/lwrBoundPaper ver3.POMDP
>../../sensor management/simulatorOutputFileLowerBoundPaper.txt
Most of these parameters are passed on to the underlying pomdp-solve-5.3 POMDP
solver code. I added the command line argument for -scenario filename, which allows
me to specify my own file of simulation parameters for running the C++ simulator
code. Most of these parameters to the POMDP solver are controlling how it prunes
hyperplanes and also specifying the use of a Finite Grid (PBVI-type) algorithm versus
one of the other 4 supported algorithms. To date I have only used the Finite Grid or
Witness algorithm variations. The one interesting parameter of note here is -fg points
that specifies the use of 1000 belief-points in this case. See Tony’s documentation for
more details about the use of these parameters.
138
B.3 Outputs of column gen
When the simulator is running simulations (and not just stopping short with using the
ColumnGen() function to generate sensor plans or lower bounds) as determined by the
two preprocessor flags mentioned in Section B.1, there are two files that the column gen
program creates as output. The first file is a (generally verbose) log-file of output that is
redirected to file according to the filename that is given as the last argument in the run
command (the last section used the example filename of “simulatorOutputFileLower-
BoundPaper.txt”). The second filename is generated programmatically according to the
current set of simulation parameters and is a comma-separated (csv) file that contains
all of the per-simulation-batch statistics that are reported by the RunSimulationCon-
figuration() function. The verbosity of the log-file can be controlled by setting the
TraceFlags global variable (more details follow in the next section). I typically import
the csv output-file into a spreadsheet program, mark-up the columns and analyze the
performance in that context. From there numerical information can also be exported to
MATLAB (or MATLAB can be used to fscanf() the fields in from the csv output-file),
and calculations or plots can be done.
B.4 Program Conventions
To begin the discussion, several different conventions were followed while working on this
program that are worth mentioning. Following these conventions allowed for higher pro-
gram clarity and lessened the opportunity for my confusing one context of the program
with another while working across multiple files.
On occasion the C++ code has “call by reference” arguments and fairly frequently
the C code returns multiple arguments via the pointer mechanism. I tried to indicate
this was happening by putting the comment at the end of a function call of the form
139
e.g. “// =>obs” to indicate that obs was being returned by reference or by pointer. I
made extensive use of dynamic memory allocation, but stuck with Tony’s version of the
malloc calls: XMALLOC(), XFREE() and so on. When I had a dynamically allocated
variable, that variable was passed using pointer notation (not array notation), however
I attempted to indicate the sizes of all dynamically allocated structures using comments
of the form e.g. “pArray/*[2][3]*/” to indicate that despite the fact that “pArray” may
have been a pointer of type “double **”, it had been allocated to store a matrix with
2 rows and 3 columns. This type of notation helped me out significantly in not getting
confused while changing contexts between what I was working on.
I wrote this program assuming that one vehicle would have multiple sensors under
its direction and that eventually the sensors would be constrained to moving together
(and only looking at things within a certain range of the vehicle). This is still work
in progress. Currently there are no constraints on what locations/objects a vehicle’s
sensors can look at (other than the 0/1 type visibility constraints). Whereas I would
have preferred to set a sensor-centric limit on the locations each sensor can look at (like
in the CVehicle (aka sensor-platform) class), the ColumnGen() planning sub-routine was
written well before knowing it would be used in this way. Therefore it was easier to store
information in the TaskList[] array (an output of and also an input to the ColumnGen()
function) to specify which sensors can look at which locations. The TaskList[] global
array-variable has location-by-location information stored in it during the ColumnGen()
function’s execution (more to follow). At this point in time, the CVehicle class in the
simulator is just a container for all the sensors in the program, and no code concerning
the positions of sensors (or locations of cells) is active/useful. The global array TaskList[]
ought to be a local variable, an argument to the ColumnGen() function, and be passed
down through the call-chain. This would fix several different issues, but is still work
in progress. As is there is just one vehicle and there is no relationship/no constraints
140
on the activities of the sensors which it holds. The only issue is that the simulation
calls pVehicle->update() once per update cycle (were pVehicle is actually an Standard
Template Library (STL) iterator), which causes one sensing task in the vehicle’s task-
list (CVehicle::m taskList) to be undertaken. A more realistic account of time might
entail updating each “vehicle” once for each sensor it contains, or else creating multiple
vehicles that are each limited to containing one sensor. Again, this is work in progress.
The vehicles maintain vector-resource information and vector-constraints on expected
resource expenditure and handle multiple tasks for multiple sensors appropriately. For
better or worse they process the tasks for multiple sensors in serial fashion however; if
it mattered, this could easily be changed. Another reason that the TaskList[] variable is
suboptimal is because it is very similar in name to the CVehicle::m taskList variable, and
the two variables have no relation to each other. Additionally, if the Column Generation
code was not dependent on the TaskList variable, then it would be possible to run
multiple CSimulator objects in parallel (i.e. to support multiple concurrent simulations).
I attempt to more or less exhaustively assert every condition I can at the beginning
of function-calls, and elsewhere as well. These assertions have the form ‘Assert(x >0,
“Something is broken, x <= 0”);’. The string in the second argument of the “Assert()”
function is displayed when the assertion fails (and the program terminates at that point).
While there were many false alarms that had to be dealt with in working with these
assertions (getting them straight/self-consistent), they were nonetheless extremely useful
in assuring a consistent+valid state of the program. In addition, this assertion function
(at least under Linux) gives a file-name and line-number when it fails, which helps in
the debug process quite a bit. In developing code, the sooner one can detect a problem
after it occurs, the easier it is to handle and fix. So as not to slow the program down
after it has been debugged, the “Assert()” function can be “#define’d” away to a null-
function or else a pre-processor directive can be used to comment out the body of the
141
function. The one issue is that any assertion of the form “Assert(0,””);” should not
really be an assertion, it should be an “exit(-1);”-type statement. Actually Tony wrote
an Abort() function that provides a parameter for a textual description of the error
condition, I should have been using that function instead of “Assert(0,””);”. After the
program is stable (and I believe it is at least fairly stable + well-debugged as is), the
Assert() statements can be turned off but the Abort() and exit()-like statements should
remain. At that point it would make sense to start turning on compiler optimizations
and tuning the algorithm for speed using performance profiling.
I also #define’d a series of “VALID [something]()” type macros in global.h that are
used to test the validity of the range of some variable in a consistent fashion. Consistency
is everything. These macros also reveal a lot of information about the conventions used
in this program and so are a very good means of studying how the various variables are
used and what types of values for e.g. states or observations or sensor indices etc. . . are
allowable.
As much as possible, whenever I changed Tony’s code, I attempted to document
those changes with comments of the form /* Darin’s Modification [date] – description
*/. This worked well when I was making precise changes to his code, but was more messy
at the interface of what my code and his code were doing. I think at some point with
the column gen.c and pomdp solve 5 3.c files, I quit bothering; those files are basically
of my authorship now.
There are numerous different types of “0” in programming, and I tried to disam-
biguate between them to make the code clearer. So first and foremost I used “0” for
ints (integers) and “0.0” for floats and doubles. And while “NULL” equates to a “0”
for string and pointer comparisons, I used “NULL” anyway for clarity of exhibition.
The same goes for using “FALSE” and “TRUE” in C code (or “false” and “true” in
C++ code) instead of merely 0 and 1.
142
I have used the acronym “WARN”, as in “WARNING” in places where a piece of code
is problematic, prone to failure or otherwise in need of attention. Test code is frequently
commented in or out using “#if 1 (or 0)” preprocessor statements. Any code that was
completely temporary (and not supposed to remain in the program) was generally set
off in a block of text-comments (such as // . . . //) and labeled with a comment “remove
on sight”.
I frequently used Hungarian notation for variables, so “bReadyToReplan” is by con-
vention a binary-type variable. A variable “nSizeX” would be of integer type (one of the
variants) and “fLength” would be a floating-point value (type float). Member variables
for classes were prefixed with “m ”, so “m bInit” might be used for a true-false mem-
ber variable. In general I tried to use a convention such as “policyGraph” rather than
“policy graph” for naming variables, but was not entirely consistent. This deserves re-
mediation. I also wanted to choose a naming convention (and coding/indentation style)
that was different from Tony’s to make it easier to delineate the boundaries between the
code we have each written.
The lp solve-5.5 program has the habit of sticking things in the 0th
position in an
array or matrix and then indexing that array or matrix with a 1-based notation (like
MATLAB). One example of this is that the objective function coefficients are stored in
row 0 of the constraint matrix and then the actual constraints start at row 1. Similarly,
while requesting the solution outputs from lp solve-5.5, the primal variable values, dual
variable values and I believe the objective function value are all lumped into the same
array and have to be indexed appropriately.
Originally it was necessary to specify a .POMDP file that defined a POMDP model
and an .alpha file to specify the “terminal-values” that were used to initialize the
cost function (for a finite-horizon POMDP). It took me a while to figure out that the
“terminal-values” argument is actually for initializing the solvePomdp() function which
143
is at the core of Tony’s POMDP solver. The .POMDP file structure is quite general and
flexible, but the .alpha file structure is much the opposite. Therefore, eventually I got
away from using .alpha files and defined my own value-function initialization parameters
(i.e. FA and MD hyperplanes) programmatically.
Currently, a pair of input files is used to run a simulation. Concurrency must be
maintained between the simulation .data file and .POMDP files. There are dependen-
cies in several different ways (w.r.t. the dimensions of the state and action spaces). I
am currently using “ver3” of these files. The .POMDP file is still used to define the
POMDP model parameters (states, actions, observations and immediate rewards). The
immediate rewards are actually over-written later on programmatically, but it’s still im-
portant to specify dummy reward values in the .POMDP file or else storage will not be
allocated, seeing as the immediate rewards are stored in a sparse representation (0-valued
immediate rewards are not stored and the immediate reward of an action is inferred to
be 0 if no reward is found for that action in the sparse representation).
B.5 Global variables in column gen
There are actually multiple data-types that Tony has defined for working with immediate
rewards. He uses the global variable “gImmRewardList” which is a linked-list whose
nodes can represent scalar values, vectors or matrices. See the functions in mdp/imm-
reward.c: “updateRewards()” and “updateActionReward()” for more information. A
global variable “gProblemType” is used in Tony’s code to set the solver’s behavior to
solve either MDPs or POMDPs. The nodes in his gImmRewardList linked-list, of type
Imm Reward List, can have the “type” of either ‘ir value’, ‘ir vector’ or ‘ir matrix’,
where the type used in program execution will reflect how the immediate rewards were
given in the .POMDP file.
Currently I am actually programmatically generating actions (the set of feasible ac-
144
tions) based on the contents of the .POMDP file as well. I use the actions defined in the
.POMDP to specify a template for the actions a sensor can support. (This is a departure
from Tony’s paradigm). The (primary) global variables gNumStates, gNumActions and
gNumObservations are set as the .POMDP file is read in. This happens very early in the
program starting from the initPomdpSolve() function. (I broke the initialization code of
pomdp-solve-5.3 into pieces however). After reading in the .POMDP file, my simulation
data input file, of type .data, is read in and then, according to the instructions in my
file, sensors (sensor actions) are instantiated based on the sensor-action templates in the
.POMDP file. Therefore I modify the value of gNumActions on the fly. I read in and
parse my simulation parameters in the function ReadSimulationData() in column gen.c.
(More to follow).
The following is a list of most of the significant global variables that were used within
the original pomdp-solve-5.3 program (and are mostly still in use now):
• Matrix *pomdp P/*[TotalSensorActions]*/: POMDP model transition prob.
• Matrix *pomdp R/*[TotalSensorActions]*/: POMDP model observation prob.
• Matrix pomdp Q: POMDP model immediate values for state-action pairs
• I Matrix *IP: temporary matrix of transition prob. (used while reading .POMDP file)
• I Matrix *IR: temporary matrix of observation prob. (used while reading .POMDP file)
• int gNumStates: number of states in the POMDP model
• int gNumActions: number of actions in the POMDP model (I rewrite this value)
• int gNumObservations: number of observations in the POMDP model
• int *gNumPossibleObservations/*[gNumActions]*/: mainly used to ensure that every
action generates at least one observation
• int **gObservationPossible/*[gNumActions][gNumObservations]*/: controls which branches
of the decision-tree are so improbable as to not be worth “walking” (to include in
expected cost calculations), also determines the projections that are created in the
POMDP backup operation
• Imm Reward List gImmRewardList: linked-list that stores immediate reward values
• Problem Type gProblemType: should be ‘POMDP problem type’ for our case
• double gDiscount: should be 1.0 for a finite-horizon POMDP
• Value Type gValueType: should be ‘REWARD value type’, the cost-based formulation
145
is broken
• double gMinimumImmediateReward: equal to the most costly measurement reward,
jump-starts Value Iteration by establishing a lower bound on the costs
• double *gInitialBelief: not part of the active code-base, I specify prior probabilities in
my own input (.data) file
Tony’s code defined the sparse matrices P (transition probabilities), R (observation
probabilities) and Q (immediate rewards), which was all well and fine when just one
POMDP was being solved, but after starting to work with visibility groups, this was
no longer the case. Therefore I did a project-wide search+replace and renamed these
variables pomdp P[], pomdp R[] and pomdp Q, but far more significantly, I introduced
arguments to a large portion of his hundreds of functions (the entire code-base of 60–80K
lines), so that the program is not forced to refer to pomdp P[], pomdp R[] and pomdp Q.
I introduce a structure that I call “sensorPayload” that contains these values, as well
as a customized version of immRewardList, actionNames[] etc. . . that allows POMDPs
with different numbers of actions (or different encodings for actions) to be solved by his
code. I should have called “sensorPayload”, “sensorConfiguration”, the name is a bit of
a misnomer. At any rate I define one such structure (that represents all the POMDP
problem parameters) and pass it in to Tony’s solvePomdp() function to compute a
solution. This is the largest single change I made to his program, and it took about a
month to do this, create the programmatically defined actions and debug the results.
I also had to break certain portions of the code that was in solvePomdp() into pieces
and moved some code to per-solution-call setup and shutdown code that comes before or
after the call to the solver respectively. I also changed the initialization of the Finite Grid
code so that the belief-point grid is generated once per program instead of once per call
to the POMDP solver. One last significant change that I made to the pomdp-solve-5.3
code is that I updated it to use the lp solve-5.5.0.13 solver instead of the far older and far
inferior lp solve-2.x solver. The newer version of the program is in a whole separate class
146
of program than the older. Nevertheless, this change only affects the Witness Algorithm
code, which I stopped using because it was still too slow. We decided that because
the Finite Grid method is already an approximate technique, solving LP’s to prune
hyperplanes is rather silly, so we use simpler pruning methods. For all the advances
to lp solve, I gather it still only runs about 5% as fast as cplex. Despite the fact that
lp solve-5.5.0.13 is not currently being used to solve POMDPs, it is still useful to have
the LP solver in the program when it comes to iteratively solving the LPs used in Column
Generation.
I have done a major modification to the way pomdp P[], pomdp R[] and pomdp Q
are created from the temporary matrices IP[] and IR[], see the function ReadSimulation-
Data() for details. In terms of how the matrices pomdp P[], pomdp R[] and pomdp Q
are accessed in this program, take a look at the function showProblemMatrices() that
shows typical usage. This is a debug function that prints the values of each of these ma-
trices to stdout. Tony has created numerous such debug functions that are very useful
for printing debug output such as for hyperplanes, policy-graphs etc. . . to the console.
Here is a list of some of the more important global variables and arrays I have added
to this program:
• int NumTargets: number of locations
• int GridSizeX, GridSizeY: are in use but nearly useless until vehicles (sensors) have
locality constraints
• int NumTargetTypes: number of object types (including one for ‘empty’)
• int NumDecisions: number of decision/declaration hyperplanes read in from .data input
file
• int NumSensorTypes: number of templates for sensors, .data and .POMDP file values
must jive
• int NumTargetGroups: the number of groups of targets with distinct a priori probabili-
ties
• int NumVisibilityGroups: determined by counting how many different classes of visibility
the prior probabilities are divided into in the .data file
• unsigned int *VisibilityGroups/*[NumTargets]*/: defined with the maximum possible
147
but only using NumVisibilityGroups elements
• Target *TaskList/*[MAX NUM TARGETS]*/: per location solution data after Colum-
nGen() is executed, also specifies the inputs for prior probabilities and sensor visibilities
to the ColumnGen() function
• int TotalSensorCount: number of instantiations of sensors (which are programmatically
generated from .POMDP file)
• int DistinctSensorActions: the number of (parsimonious) actions specified in the POMDP file
• int TotalSensorActions: the number of actions across all (programmatically generated)
sensors
• int *ParsimSensorStartIndex/*[NumSensorTypes]*/: I want to be able to find the index
of the jth mode of sensor i when all of the sensor modes are embedded together in
one long list (stored in a sparse format) where there are a different number of possible
modes (sensor actions) depending on which sensor is being used. Therefore I’m using
the array ParsimSensorStartIndex[] to store the index of the first mode (j = 0) for
each of the NumSensorTypes sensors. (This is very similar to how for sparse matrices
the beginning row indices are stored in the Matrix structure as row start[row i]). I’m
prepending “parsim” => parsimonious to indicate that each (sensor,mode) combination
has a unique action index associated with it in this array. Therefore a vehicle which has
multiple instances of the same type of sensor in its sensorPayload structure will have
each of its equivalent sensors’ modes be mapped to the same set of action indices when
referring to the array ParsimSensorStartIndex[]. I’m using this scheme so that I don’t
have to store action names and the relative costs for the sensor modes redundantly for
each of the duplicate/cloned/equivalent sensors that each vehicle may have. Each of the
sensorPayload structures contains a similar array called “sensorStartIndex[]” which has
the starting indices for each of the sensors contained in that sensor payload. In general
each sensorPayload structure has a distinct/independent sparse encoding for its actions
• int *SensorStartIndex/*[TotalSensorCount]*/: sparse mapping/encoding for the first
action in a joint list of actions (across all sensors) that corresponds to a sensor
• int *SensorTypes/*[TotalSensorCount]*/: type of each programmatically generated sen-
sor where the types are defined by the actions in the .POMDP file
• int *ActionToSensorMap/*[TotalSensorActions]*/: index of the (programmatically gen-
erated) sensor that does a (programmatically generated) action
• int *NumActionsPerSensorType/*[NumSensorTypes]*/: number of actions that each
sensor in the parsimonious list (from .POMDP file) can do
• char **actionNames/*[TotalSensorActions+NumDecisions][40]*/: names for (program-
matically defined) actions stored in a sparse format (i.e. all sensors confounded with a
148
variable number of “actionName” strings per sensor)
• char **stateNames/*[NumTargetTypes][40]*/: ragged array of strings for state names
• char **observationNames: ragged array of strings for observation names
• int *stateList/*[NumTargetTypes]*/: deprecated, used when I had a terminal capture
state instead of ‘wait’ actions
• double *SensorTimeCost/*[TotalSensorActions]*/: the relative cost of each sensor mode
as specified in the .data file
• UINT32 TraceFlags: 32-bit unsigned integer that stores bitwise flags used for filtering
program output
The arrays SensorStartIndex[], SensorTypes[], ActionToSensorMap[] and actionNames[]
are all sparsely-defined. The arrays NumActionsPerSensorType[] and ParsimSensorStartIn-
dex[] are defined w.r.t. the number of classes of sensor that the actions in .POMDP are
divided into. (The .POMDP file lists all the possible actions in one long list, my .data file
is responsible for associating one action or another with a class of sensor, i.e. a sensor
template. Once the number of actions per sensor template is established, it is possible
to specify how many sensors of each type are desired within a simulation).
The input files searchAndExploit ver3.POMDP and searchAndExploit ver3.data are
a pair as are lwrBoundPaper ver3.POMDP and lwrBoundPaper ver3.data. These files
have a lot of parameters between them that need to be set correctly in order for these
sparse arrays to function correctly. There are some comments at the top of my *.data
files that describe how those files are laid out and the documentation Tony provides for
his program remains in effect for the .POMDP files. First and foremost, great care needs
to be taken to specify parameters lists that are w.r.t. a parsimonious action list if the
parsimonious values are required or w.r.t. the expanded/programmatically generated
list of actions if that is required. (If there are 3 sensors that each support ‘mode1’ and
this mode has identical statistics for all 3 sensors, then ‘mode1’ would appear once in a
“parsimonious” list of actions but 3 times in the expanded/programmatically generated
list of actions, which generates 3 actions for the POMDP solver based off of the prototype
149
‘mode1’ specified in the .POMDP file). In the .data files for my simulation parameters,
I give a comment line above each row of parameter values that serves to document what
that row is for and how many parameters should be there. My input files understand %’s
at the beginning of lines and at the end as well. I specify resources and initial lambda
values w.r.t. the parsimonious action list and then replicate these values as appropriate
when sensors are instantiated from the templates defined in the .POMDP file; currently
if there are multiple sensors of the same type in a simulation, then they must share the
same values for initial sensor resources and initial lambdas (used to create an initial
basis for Column Generation).
As a brief overview, imagine the .POMDP file has the actions ‘wait 0’, ‘search 0’,
‘mode1 0’, ‘wait 1’, ‘mode1 1’, ‘mode2 1’, which serves to indicate that there are 2 classes
of sensor, that the first sensor has a ‘search’ and a ‘mode1’ action, and that the second
sensor has a ‘mode1’ and a ‘mode2’ action. This is the parsimonious action list. (I also
described it as a “Distinct” action list). (The tags (suffixes) for which sensor each action
belongs to are programmatically over-written when the actionNames[] array is created
for the expanded set of actions). The .data file might have a line that instantiates 2
of the first type of sensor and 1 of the latter: ‘3 0 0 1’ (this is the 3rd non-comment
line in the file). SensorTypes[] takes on the value of the last 3 params on this line. The
indices specify that 3 sensors of types 0, 0 and 1 will be used in the simulator. (All
indices in the program are always 0-based, except for certain lp solve-5.5 function calls).
This specification would cause a list of actions to be defined (and the value of gNumAc-
tions to be modified) such that the (non-parsimonious) actions are: ‘wait 0’, ‘search 0’,
‘mode1 0’, ‘wait 1’, ‘search 1’, ‘mode1 1’, ‘wait 2’, ‘mode1 2’, ‘mode2 2’ (this list would
correspond to the first 9 entries in the actionNames[] array. (The actionNames[] array
stores the names of the decision (aka declaration or classification) hyperplanes after this
list of 9 action names). In this example, TotalNumSensors = 3, TotalSensorActions = 9,
150
DistinctSensorActions = 6, and SensorStartIndex[] = [0 3 6]. ActionToSensorMap[] =
[0 0 0 1 1 1 2 2 2].
The simulator uses 3 separate seeds for random number generation that are used for
different purposes. The use of multiple seeds allows concurrency to be maintained across
simulation runs. The seed “gCellSeed” controls the states (the types) of each location
(cell) in a simulation run. The seed “gMeasurementSeed” controls how the trajectory of
random observations evolves. The seed “gStrategySeed” is mainly used for the methods
of RH control that employ randomization (i.e. “randomly choose a pure strategy on
a per location basis with a distribution given by the measure weights”). By resetting
the value of gCellSeed after every series of 100 Monte Carlo simulation runs (but after
changing the resource levels, MD to FA ratios, horizon etc. . . ) it was possible to ensure
that each batch of 100 simulations used the same trajectory of cell (location) states, but
that these states evolved randomly from one simulation run to the next.
Two enumerated types are worthy of mention. The first is used for controlling the
display of tracing information (i.e. logging of debug information to stdout or a file) in a
programmatically controllable fashion (rather than having all program output be fixed).
I created my own version of a printf() function, called myprintf(), and added another
parameter (a bitwise variable of boolean flags) that is passed in to each myprintf()
function call. A global variable called “TraceFlags” implements the other half of the
lock+key mechanism. When the global variable TraceFlags has matching bits with one
of the descriptive codes given to the myprintf() function, output is generated in the
log-file, see Section B.3. Therefore these trace-codes act as filters and control how much
if any debug information is stored. Unless a very specific aspect of the program is being
tested, then all of the lower-level trace-codes cause output to be generated so fast, so
verbosely, that the program can not run while they are active; they are only of use for
testing a single function call or chain of low-level function calls. I attempted to do this
151
in such a fashion as to allow certain types of context-sensitive debug information to
be displayed to the console (or to a file) to aid in the debugging process. I generally
found this mechanism useful, although sometimes it is hard to partition one aspect of
the program’s functionality into just one type of tracing operation. To handle the need
for a many-to-one mapping, I have myprintf() type statements that output trace’d data
if any one of a number of these trace-codes are active using a bitwise-OR. The values of
enum Tracecode are:
• AllTrace=0: exception case, rather than trace nothing, trace everything
• SolverTrace=(1<<0): deprecated
• LikelihoodTrace=(1<<1): calculation of likelihoods used for branch probabilities while
walking decision-trees
• InitHyperplaneTrace=(1<<2): used to determine which hyperplane/policy-graph node
a cell/location should use
• PolicyGraphTrace=(1<<3): traces calculations while walking decision-trees
• CostTrace=(1<<4): traces cost information (classification + measurement) for individ-
ual Subproblems
• TestTrace=(1<<5): a dummy trace context to be used for whatever debugging a situa-
tion requires
• IntermedLPTrace=(1<<6): prints the Column Generation LP after it is created and
after each column is added
• ComputeResUsageTrace=(1<<7): traces cost information (classification + measure-
ment) across all Subproblems
• ColGenTrace=(1<<8): prints information on each column such as Lagrange multipliers
and objective function values as they are added
• StrategyTrace=(1<<9): used to display the strategy tree (all pure strategies) after
Column Generation and also expected cost information and initial nodes in the strategy
tree for each Subproblem
• ColGenOutputTrace=(1<<10): the main trace flag for Column Generation besides
StrategyTrace, displays numerical information about lambdas, resources, expected costs
• GridStateTrace=(1<<11): used in the simulator to show cell/location states as they are
generated, useful for debugging purposes with a small grid
• NewTaskTrace=(1<<12): trace information concerning a sensing task that is to be
added on the queue of tasks
152
• FinishTaskTrace=(1<<13): trace information concerning what happens after a task is
completed (and potentially a follow-up task is created)
• TaskListTrace=(1<<14): trace information concerning sensing tasks in general
• VehicleUpdatesTrace=(1<<15): trace high-level information concerning vehicle (i.e.
sensor) resources and outstanding tasks, queue sizes
• SimUpdatesTrace=(1<<16): trace information about the task that each vehicle does
during each update cycle
• SimResultsTrace=(1<<17): trace the final results of a simulation
• CellErrorTrace=(1<<18): trace information concerning the ML-classification of each
cell at the end of a simulation
• OutputTrace=(1<<19): the main trace flag for all simulator output
• PrintPGPointersTrace=(1<<20): was used for debugging a policy-graph (PG structure)
memory leak
• NoTrace=(1<<21): a label for the extent of the trace code flags, provides a NULL or
out-of-range value for this enumerated type
The trace-codes can be modified dynamically throughout the course of the program,
i.e. after detecting an error condition, additional trace-codes can be turned on and a
conditional breakpoint can be set to start going through the nitty-gritty details of the
program’s state from that point on.
Another enumerated type is used to control the method with which the simulator
uses mixed strategies for RH Control. I call this the “simulation mode” and use enum
“Simulator Mode” to specify its value:
• eChooseByMixtureWeight: choose the pure strategy with largest mixture weight
• eChooseByProbability: choose a pure strategy randomly to be used for all locations
based on a distribution governed by the mixture weights
• eChooseByProbabilityPerCell: choose as in ‘eChooseByProbability’, but on a location-
by-location basis
• eChooseByClassifyCost: choose whichever strategy has the best performance
• eChooseByMeasureCost: choose whichever strategy uses the least resources
• eNumSimulatorModes: a label for the extent of the simulation mode enums, provides a
NULL or out-of-range value for this enumerated type
I needed access to the hyperplanes from the solver, and not as stored to a file at the
153
end of the program (after using the “save-all” pomdp-solve-5.3 command-line switch)
(this had been the status quo), but in memory during the life-cycle of the program,
so I had to dig into the solver code. These hyperplanes are stored in next alpha list
and prev alpha list in the solvePomdp() function that Tony wrote. I added a level
of indirection that allowed me to use the pointer mechanism to output these lists
from solvePomdp(). I only ever bothered to get the hyperplanes from the last stage
and penultimate stage however. I needed the latter because the former points into
the latter. Therefore I had to wait to destroy the penultimate-stage hyperplanes un-
til I was done tracing the policy-graphs to back out the measurement versus clas-
sification costs. Policy-graphs are stored in a structure that holds all the data for
one stage, see some of Tony’s debug/output routines in pg.c for more info. In or-
der to represent strategies, I created a 3D dynamically allocated array: “PG ***
strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/”. An individual
decision-tree could be represented as: “PG * policyGraph/*[horizon]*/”, however when
working with visibility groups, different subsets of sensors will in general have distinct
POMDP solutions with different decision-trees, hence the need for the 2nd dimension
in this structure. The 3rd dimension with size [TotalSensorCount+1] is needed to rep-
resent one whole mixed strategy (as a mixture of TotalSensorCount+1 pure strategies).
At points in the code I referred to one of the component pure strategies as a “PG **
policy graph group/*[NumVisibilityGroups][horizon]*/”. Hyperplanes are stored as a
linked-list of “AlphaList” nodes and the ids of the nodes correspond to the ids of the
hyperplanes. The hyperplane coefficients themselves are stored in the member “alpha[]”
of AlphaList. One other important point is that for whatever reason, Tony left the root
node in a list of hyperplanes as a dummy/header node (stores summary statistics not
hyperplanes) and then all the actual data follows after that in the linked list. He also
has 2 separate ways of accessing immediate rewards (the pomdp Q sparse matrix and
154
the gImmRewardList linked-list). In addition he uses the global variable gCurAlphaVec-
tor as an array (basically a vector) that points into his linked-lists of hyperplanes. So
he maintains hyperplanes both as a linked-list of nodes and also indexes them with a
vector/array as the situation warrants.
After the call to solvePomdp() and after tracing the decision-trees, I store per-task
information on classification costs and measurement costs in the global array TaskList[].
This methodology with this global array variable is suboptimal because it prevents mul-
tiple simulators from being able to run in parallel, and I had the intention of modifying
this code to make it more object-oriented; this is work in progress, but at least it works
as is.
The routine that runs the Column Generation algorithm in the simulator is:
double ColumnGen(
const SensorPayload *sensorPayloadList/*[NumVisibilityGroups]*/,
PomdpSolveParams param,
PG **policy graph group/*[NumVisibilityGroups][horizon]*/,
double **lambdaOfStrategy/*[MAX COLUMNS+1][TotalSensorCount]*/,
const double *R/*[TotalSensorCount]*/,
double *initLambda/*[TotalSensorCount]*/,
PG ***strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/,
int *strategyToColumnMap/*[TotalSensorCount+1]*/,
int *pColumnsInSolution,
REAL *pSolution/*[TotalSensorCount+1]*/,
double *J classify perStrategy/*[TotalSensorCount+1]*/,
double **J measure perStrategy/*[TotalSensorCount+1][TotalSensorCount]*/)
The return value is the optimal total (measurement+classification) cost of the LP at the
end of Column Generation. Occasionally, it is possible that a degenerate mixed strategy
155
is created (less pure strategies in the LP’s basis than anticipated) in which case I still
deem the result the “optimal” cost. However, when lp solve-5.5 solves the LP for the
Column Generation Master Problem, if it does not return back a successful error flag
(that the optimal solution was found or that the LP was solved in the pre-solve step),
then the program Assert’s false and terminates. The arguments to ColumnGen() have
the following purposes:
• const SensorPayload *sensorPayloadList/*[NumVisibilityGroups]*/: describes the POMDP
problem parameters for each subset of sensors
• PomdpSolveParams param: Tony’s main structure that I pass on to his solvePomdp()
routine
• PG **policy graph group/*[NumVisibilityGroups][horizon]*/: temporary storage that I
use repeatedly
• double **lambdaOfStrategy/*[MAX COLUMNS+1][TotalSensorCount]*/: stores the tra-
jectory of lambda values
• const double *R/*[TotalSensorCount]*/: the available resources for each sensor
• double *initLambda/*[TotalSensorCount]*/: the initial lambda values used to start Col-
umn Generation
• PG ***strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/: pure strate-
gies output from Column Generation
• int *strategyToColumnMap/*[TotalSensorCount+1]*/: mapping that describes the ac-
tive columns/the columns with support at the end of the Column Generation process
• int *pColumnsInSolution: value returned by pointer for the number of columns (out of
a total of (MAX COLUMNS+1)) that were generated in Column Generation
• REAL *pSolution/*[TotalSensorCount+1]*/: the mixture weights of each strategy with
support
• double *J classify perStrategy/*[TotalSensorCount+1]*/: the classification cost (across
all N locations) for each strategy
• double **J measure perStrategy/*[TotalSensorCount+1][TotalSensorCount]*/: the mea-
surement (sensor resource) cost (across all N locations) for each strategy
For simplicity, in the Column Generation process storage was allocated for up to (MAX COLUMNS+1
and then only some of that memory was actually used as reported by the value of
(*pColumnsInSolution). The array strategyToColumnMap[] is a mapping for the pure
156
strategies indexed as [0,. . . ,TotalSensorCount+1] back to the range [0,. . . ,MAX COLUMNS+1]
and specifies the (TotalSensorCount+1) elements out of (MAX COLUMNS+1) that are
part of the solution. For instance lambdaOfStrategy[strategyToColumnMap[i]][j] would
give the jth
Lagrange multiplier of the ith
strategy.
In some situations it is possible for the ColumnGen() function to fail to find a non-
trivial Column Generation solution (by failing to establish an initial basis of linearly-
independent columns for the LP), or to otherwise give back results where e.g. just one
pure strategy is used when there are 2 sensors (so one would expect 3 pure strategies in
the solution). If less than TotalSensorCount+1 pure strategies have support, then one or
more elements of strategyToColumnMap[] are set to -1. These circumstances typically
arise at the end of a simulation when few resources are remaining and either no modes
are feasible for a sensor given its resource constraints, or else the cost versus benefit of
using a sensor mode weighs against the use of any resources in a particular situation.
This can happen for instance if all of the cell information states (the probability vectors
for each of the locations) are already very lopsided (low entropy), and there is not much
uncertainty left in the states of any of the locations.
When some sensors effectively have no more resources, but other sensors are still
useful, it would be ideal to eliminate the use of the resource-less sensors from consid-
eration in the ColumnGen() function and to reduce the size of the POMDPs that are
solved in the solvePomdp() algorithm. Currently this is work in progress. The untoward
result of not doing this pruning operation is that many more columns can be generated
while running Column Generation and looking for precisely the right value of Lagrange
multiplier that will satisfy a nearly infeasible constraint. While this does not crash the
program, it does waste some time. However if presolving is used with the lp-solve 5.5
solver and constraints are eliminated from the Column Generation LP, it is important
to make sure to access the solution (Lagrange multipliers in the solution) in the right
157
way. For example, the lp-solve 5.5 functions get Nrows() and get Ncolumns() return
back the number of rows and columns of the presolved model (after rows or columns
are eliminated). The functions get Norig rows() and get Norig columns() return back
the number of rows and columns in the LP before presolving. In order to know solution
variables relative to the original enumeration for variables and constraints, the function
get var primalresult() must be used. Currently my code only supports a static indexing
of rows and columns and does not perform presolving operations that eliminate rows or
columns. See the function CreateLP() in column gen.c for the relevant section of code.
(However, when I modified Tony’s pomdp-solve-5.3 code to make use of lp-solve 5.5, I
did enable presolving w.r.t. solving the LPs that prune hyperplanes. See the function
LP loadLpSolveLP() in his file lp-interface.c for more details). Eliminating sensors from
consideration that are not useful and that slow down Column Generation is one of the
two or three most important things that can be done to improve this program.
B.6 Simulator Design
Fig. B·1 shows how the column gen program’s startup process. The function parseCmd-
LineAndCfgFile() is from pomdp-solve-5.3 and is responsible for generating the Pomdp-
SolveParams structure that contains all of the parameters for the POMDP solver except
for the matrices pomdp P[], pomdp R[] and pomdp Q. After a CSimulator object is
created, I call initPomdpSolve() and the .POMDP file parameters are read in from
file. My simulation parameters are parsed in the function ReadSimulationData() and
the matrices pomdp P[], pomdp R[] and pomdp Q are created at this time (from a
set of intermediary/temporary, dense matrices that were created during the execution
of initPomdpSolve()), and the linked-list gImmRewardList is created. At this time I
create a sensorPayload structure that represents a custom set of solution variables for
every subset of sensor visibilities that is of interest. The visibility groups of interest
158
are determined after reading in the groups of prior probabilities from the .data file and
determining how many distinct classes of visibility these priors are arranged into. As
discussed in Ch. 3, prior probabilities for targets and sensor-target visibility is specified
in the .data file (the last lines in the file) for example as:
• 10 01 0.10 0.20 0.60 0.10 % 10 targets w/ prior π1 can be seen by sensor 0
• 90 11 0.02 0.06 0.12 0.80 % 90 targets w/ prior π2 can jointly be seen by sensors 0+1
In this example π1 = [0.100.200.600.10]T
and π2 = [0.020.060.120.80]T
and the position
of the 0th
sensor is on the right-hand side of the bitmask. The total number of targets
that these target groups add up to needs to be consistent with the parameters given in
the first line of the .data file (NumTargets = NumCellsX * NumCellsY). Once again,
until there is motion planning and locality constraints, the physical dimensions of the
grid (the layout of the locations) have no meaning.
The following is a pseudo-code for the operation of ColumnGen():
• set all Lagrange multipliers to ∞
• call ComputeResourceUsage() to compute classification cost for do-nothing strategy
• for i=1 to TotalSensorCount (i=0 is the do-nothing strategy):
– set Lagrange multipliers to initialize ith column
– call ComputeResourceUsage() to compute costs for strategy i
– add pure strategy from solvePomdp() to strategy tree
– store objective function coefficients and resource usage data in variables for LP
tableau
• if initialization was not successful, break out of function
• call CreateLP() using variables stored for LP tableau
• while Column Generation not converged:
– call SolveLP()
– store Lagrange multipliers from the solution of the LP
– call ComputeResourceUsage() with current Lagrange multiplier vector
– add pure strategy from solvePomdp() to strategy tree
– store objective function coefficients and resource usage data in variables for LP
tableau
159
main()
pSim::CSimulator
param = parseCmdLineAndCfgFile(argc, argv)
TraceFlags = ...
new CSimulator
create(param, resourceLevel, eSimMode)
1) parse POMDP file with lex / biso
2) allocate temp mem structures in m_param
3) init gNumStates, gNumActions, gNumObservations,
and some other pomdp−solve−5.3 globals
initPomdpSolve(m_param)
m_grid::CGrid
m_cells:CCellList
CCell::CCell()
initialize m_param, m_origHorizon,
m_resourceLevelIndex, eSimMode,
m_lambda, m_R,
m_strategyToColumnMap,
m_lambdaOfStrategy,
m_strategyTree, m_policy_graph_group,
m_pSolution,
m_J_classify_perStrategy,
m_J_measure_perStrategy
CGrid::CGrid()
init(m_param, TaskList, ...)
m_bInit = true
m_vehicleList::CVehicleList
addVehicle(eSearchVehicle, ...)
ReadSimulationData(m_param, ...)
1) parse my .data file of simulation param
2) basic param checking
3) allocate my global vars
4) allocate my global sparse arrays for sensor actions that are
instantiated from a template defined in POMDP + .data fi
5) redefine pomdp−solve−5.3 global var
6) generate POMDP matrices P, R, Q for each sensor configuration,
recreate the immediate reward lists (cached rewards)
7) create TaskList (cell identities, priors)
8) free temporary pomdp−solve−5.3 matrices
9) call initPomdpAlgorithm to create the grid of belief points
CVehicle::CVehicle()
Figure B·1: Sequence diagram of the startup process in column gen.
160
– create and append new column with variables for LP tableau
• find the strategies in the final LP tableau with support
• call ComputeResourceUsage() for each final pure strategy with support and (this time)
record per-task solution data
• output solution data to console or to file as determined by the trace flag settings
where the function ComputeResourceUsage() has the following form:
• for i = 0 to TotalSensorCount-1:
– for v = 0 to NumVisibilityGroups-1:
∗ if vth sensorPayload contains ith sensor then update sensor payload’s Q and
immRewardList structures for sensor price according to current lambda vector,
convert to reward formulation
• update master list (pomdp Q and gImmRewardList) for the new sensor prices, convert
to reward formulation
• for j = 0 to NumVisibilityGroups-1:
– get pointer to jth sensorPayload structure
– call initPomdpSolveSolution(), per-solver-call code I specialized out of Tony’s init-
PomdpSolve() function
– call solvePomdp() using POMDP model in jth sensorPayload structure
– call convertActions() to convert the results of solvePomdp() which are relative to
the action list stored in jth sensorPayload structure, to the global, joint-action list
– trace hyperplane info or strategy (policy-graph) (or not) according to trace flag
settings
– initialize measurement-cost vector J measure[] to 0.0
– initialize scalar variable J classify to 0.0
– for i = 0 to NumTargets-1:
∗ call computeExpectedCosts()
· call findInitialHyperplane() to find the best hyperplane (action) for Sub-
problem i
· call walkPolicyGraph() to break apart classification and measurement
costs for Subproblem i
· accumulate the measurement cost of Subproblem i in measurement-cost
vector J measure[]
∗ accumulate the classification cost of Subproblem i in the scalar variable J classify
161
∗ store per-Subproblem solution info in TaskList[] structure, convert from reward
formulation
– call cleanUpPomdpSolveSolution() to do per-solver-call cleanup work
Fig. B·2 shows how/when the C++ simulator makes use of the ColumnGen() al-
gorithm for computing sensor plans. The general algorithm consists of establishing a
count-down timer that re-plans when either a) a number of tasks has been done that
is equal to some multiple of the number of targets (Subproblems), or b) there are no
more outstanding tasks to do. Fig. B·3 gives an overview of the C++ simulator’s oper-
ation as it concerns RH Control and the execution of sensing tasks. The algorithm is
fairly straight-forward and has relatively low computational complexity (and program-
ming complexity). The one issue worthy of discussion is how the simulator maintains
an expected resource budget for each sensor. The code keeps track of the sensor re-
sources available as well as the expected resource cost for all outstanding (scheduled)
tasks. The simulator does not try to schedule any tasks beyond the point where: sensor
resources available - (expected resource cost for all scheduled tasks + cost of new task)
< 0. Instead, tasks are stored on two separate prioritized lists (queues): a list of tasks
that are definitely scheduled to be executed “m taskList” and then a list of tasks that
are conditionally executed if resources are left over after executing all the tasks from
m taskList. The latter list of potentially executed tasks is called “m delayedTaskList”.
delayed =>potential tasks. (If resource utilization is less than expected, extra tasks from
m delayedTaskList are scheduled). Tasks are added onto these two lists according to an
entropy-gain value that is used to sort/prioritize tasks. A new task with high priority
can pop lower value task(s) off of m taskList and push them onto m delayedTaskList.
(However, currently I don’t think that one high value task can pop off two lower value
ones. . . ). Tasks that can not be scheduled within the expected resource constraints are
simply ignored.
162
main() pSim::CSimulator
update()
iter−>clearTaskList()
optimalMixedCost =
computeOptimalPolicy()
assignTasksToVehicles()
assignTasksToVehiclesMyopically()
tasksAssigned?
strategyIndex = this−>*strategyChoosingFunction()
CTask currentTask(m_grid.cells(x,y),
policy_graphs, emptyGraphs,
decisionDepth, horizon, initNodeIndex,
strategyIndex)
success = iter−>addTask(currentTask,...)
m_vehicleList::CVehicleList
1) add task to m_taskList, a prioritized list,
while not exceeding expected resource budget
2) if not enough expected resources, save task
in m_delayedTaskList and try later
3) if new task more important than others and
not enough resources to go around then drop
the least valuable task if it will make up for the
deficienc
generate prioritized list of all feasible tasks for all
sensors with resources based on entropy gain
CTask currentTask(m_grid.cells(x,y),
policy_graphs, emptyGraphs,
decisionDepth, horizon, initNodeIndex,
strategyIndex)
success = iter−>addTask(currentTask,...)
1) add task to m_taskList, a prioritized list, while
not exceeding expected resource budget
2) if not enough expected resources, save task in
m_delayedTaskList and try later
3) if new task more important than others and not
enough resources to go around then drop the least
valuable task if it will make up for the deficienc
reset counters
! time to replan?
Figure B·2: Sequence diagram of how a sensor plan is constructed in
column gen.
163
main()
...
pSim::CSimulator
update()
iter−>update()
m_vehicleList::CVehicleList
! time to replan?
currentTask::CTask
action = currentTask.actionIndex()
success = useSensor(currentTask, obs)
(obs returned
by reference)
computeRemainingResourceCost(...)
currentCell::CCell
CCell& currentCell = cell()
state = type()
updateMeasureWeight(Pr(y), obs)
setPi()
1) get (immediate) resources
required for currentTask
enough Resources?
5) remove immediate resources used
6) free up expected resource budget
for paths not taken in the tree
(for follow−up tasks)
follow−up tasks?
7) create follow−up task
8) store it in m_followUpTaskList
9) call addTask(followUpTask)
10) remove from m_followUpTaskList
if added to m_taskList ok
pop currentTask off of m_taskList
at end of useSensor unconditionally
no yes
no yes
1) try to call addTask() with
delayed tasks from m_delayedTaskList
2) try to call addTask() with follow−up
tasks from m_followUpTaskList
(computes expected
resource usage up to
a certain horizon)
~CTask()
x
simulationDone =
(!checkIfVehiclesHaveResources())
2) generate random observation
3) find next belief state using Bayes’ Rule
4) find Pr(y | , action )i
Figure B·3: Sequence diagram of the update cycle in column gen.
164
The simulator keeps track of several different statistics throughout the course of the
batches of simulations that are executed upon calling RunSimulationConfiguration()
in main.cpp. The average simulation cost, variance of the simulation cost, the average
number of unfinished tasks per simulation, the average number of interesting objects per
simulation and the average number of unused resources at the end of each simulation
are reported via the pointer-return mechanism. The quantity for the average number of
unfinished tasks per simulation, “avgUnfininishedTasks”, is defined as: avgUnfinished-
Tasks = sizeof(m delayedTaskList at the end of the simulation) / numSimulations. This
is basically the average number of tasks that were planned for in the Column Generation
code but which were not ever actually executed because of the discrete nature of task
resource expenditures, e.g. it is not possible to do 1/100th of a mode2 action.
Fig. B·4 - Fig. B·7 provide the interfaces for the main 5 classes used in the C++ sim-
ulator: CSimulator, CVehicle, CGrid, CCell and CTask. The CSimulator class contains
the heart of the simulation code. The CVehicle class is a container that represents a
sensor-platform. The CGrid and CCell classes are fairly trivial.
To conclude this appendix, a brief word is in order concerning how tasks generate
follow-up tasks. Rather than have each CVehicle object (representing 1+ sensors) deal
with planning a whole decision-tree’s worth of tasks in its queue of tasks, the method
employed was to create just one task in the queue that could create follow-up tasks
as required (depending on the actions specified by the associated decision-tree). Each
task is associated with a pure strategy and therefore all follow-up tasks are produced
according to child-nodes of the decision-tree (pure strategy) that was used to generate the
original task. When tasks are scheduled, they are budgeted not just according to their
immediate (deterministic) sensing-resource costs, but according to the expected “down-
stream” resource costs for the task and all of its potential child/follow-up tasks. In order
to keep track of the down-stream resource costs, some book-keeping was actually done
165
on the nodes of the decision-tree (PG nodes) while walking the decision-trees for each
pure strategy. The variable J measure downstream[] keeps track of this information.
The objective was to not just keep track of the immediate deterministic resource costs
(which is trivial), or all of the resource costs until the end of the horizon (which is
what the walkPolicyGraph() function does), but to keep track of expected resource
expenditures over intermediate horizons as well. This allows the simulator to follow
a plan that in expectation will execute e.g. two sensing tasks per location when the
Column Generation sensing plan was computed with a horizon of 4 sensing actions per
location.
166
CSimulator
− CGrid m_grid;
− CVehicleList m_vehicleList;
− int m_updateCount;
− PomdpSolveParams m_param;
− ColumnGenParams m_simParams;
− int m_origHorizon;
− int m_resourceLevelIndex;
− SensorPayload * m_sensorPayloadList/*[NumVisibilityGroups]*/;
− double *m_lambda/*[TotalSensorCount]*/;
− double *m_R/*[TotalSensorCount]*/;
− PG ** m_policy_graph_group/*[NumVisibilityGroups][horizon]*/;
− PG *** m_strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/;
− REAL * m_pSolution/*[TotalSensorCount+1]*/;
− double * m_J_classify_perStrategy/*[TotalSensorCount+1]*/;
− double ** m_J_measure_perStrategy/*[TotalSensorCount+1][TotalSensorCount]*/;
− int *m_strategyToColumnMap/*[TotalSensorCount+1]*/;
− double **m_lambdaOfStrategy/*[MAX_COLUMNS+1][TotalSensorCount]*/;
− int m_nColumnsInSolution;
− bool m_bInit;
− bool m_bPlanIsReady;
− int m_nActionsBeforeReplanInit;
− int m_nActionsBeforeReplan;
− int m_nCurrentDecisionDepth;
− int assignTasksToVehicles();
− int assignTasksToVehiclesMyopically();
− int chooseStrategyByMixtureWeight(/*const int cellIndex*/) const;
− int chooseStrategyByProbability(/*const int cellIndex*/) const;
− int chooseStrategyByClassifyCost(/*const int cellIndex*/) const;
− int chooseStrategyByMeasureCost(/*const int cellIndex*/) const;
− void debugTestStrategyChoosingFunctions(const int numTrials = 1) const;
− void getVehicleResourceInfo();
+ CSimulator();
+ virtual ~CSimulator();
+ bool update();
+ const CVehicleList::size_type addVehicle(const CVehicle& newVehicle);
+ void removeVehicle(const int index);
+ void setSimMode(const Simulator_Mode newMode);
+ void setActionsBeforeReplanInit(const int numActions);
+ void create(PomdpSolveParams p, const int resourceLevelIndex,
Simulator_Mode eSimMode = eNumSimulatorModes);
+ void reset(const int resourceLevelIndex = 0, int MDToFARatio = 1,
int newHorizon = −1, const Simulator_Mode eSimMode = eNumSimulatorModes,
const bool bJustCreated = false);
+ void destroy();
+ bool checkIfVehiclesHaveResources() const;
+ void computeTotalVehicleResources(double * leftOverResources) const;
+ unsigned int computeTotalUnfinishedTasks() const
+ const double simulationCost(const bool bDisplayCellCosts = false) const;
Figure B·4: Interface for CSimulator class.
167
CVehicle
− int m_id;
− int m_nHorizon;
− bool m_bUseResourceConstraints;
− CTaskList m_taskList, m_followUpTaskList, m_delayedTaskList;
− double * m_Pr_y/*[gNumObservations]*/;
− double * m_sensorResources/*[TotalSensorCount]*/;
− double * m_initSensorResources/*[TotalSensorCount]*/;
− double * m_taskedResources/*[TotalSensorCount]*/;
− bool useSensor(CTask& currentTask, int& obs/*,
double& remainingResourceCost*/);
− const unsigned int processFollowUpTasks();
− const unsigned int processDelayedTasks();
+ Vehicle(const Vehicle_Type type, const CPoint loc,
const double * sensorResources/*[TotalSensorCount]*/, int horizon,
CGrid& grid, const bool bUseResourceConstraints);
+ virtual ~CVehicle();
+ void reset(const bool bUseResourceConstraints, const int newHorizon,
const double * sensorResources = NULL);
+ void resetSensorResources(
const double * sensorResources/*[TotalSensorCount]*/ = NULL);
+ void clearTaskList(); // also resets m_taskedResources
+ bool addTask(const CTask& task, const bool isDelayedTask);
+ void printTaskList(const bool bDisplayAll = false) const;
+ const CTaskList::size_type numTasks() const;
+ const CTaskList::size_type numFollowUpTasks() const;
+ const CTaskList::size_type numDelayedTasks() const;
+ const double sensorResources(const int sensorIndex) const;
+ const double initSensorResources(const int sensorIndex) const;
+ const double taskedResources(const int sensorIndex) const;
+ const double availableResources(const int sensorIndex) const;
+ const unsigned int update();
Figure B·5: Interface for CVehicle class.
168
CGrid
− int m_nWidth, m_nHeight;
− CCellList m_cells;
− bool m_bInit;
− void createCells(const Target *targetList);
+ CGrid();
+ virtual ~CGrid();
+ const int width() const;
+ const int height() const;
+ const int numCells() const;
+ const unsigned int numThreats() const;
+ const bool isInit();
+ void reinit();
+ void reset();
+ void destroyGrid();
+ double calcClassificationCost() const
+ void init(PomdpSolveParams param,
const Target *targetList,
const int width, const int height);
CCell
− int m_nType, m_nCellIndex;
− AlphaList m_decisionHyperplanes;
− AlphaList m_bestHyperplane;
− double m_classifyCost;
− double m_value, m_entropy, m_entropyChange;
− int m_modeOfMaxChange;
− double * m_pi, * m_lastPi, *m_origPi;
− void setValue();
− void setEntropy();
+ CCell(const int cellIndex, AlphaList decisionHyperplanes,
double * pi, const int x, const int y,
const int type = NumTargetTypes);
+ CCell(const CCell& right);
const CCell& operator = (const CCell& right);
+ virtual ~CCell();
+ CMyopicMeasurement findActionOfMaxEntropyChange
const int sensorIndex, const double resourcesAvailable);
+ const double * pi() const { return m_pi; }
+ void setPi(const double * pi, const bool bZeroLastPi = false);
+ void reinit(const int type = NumTargetTypes);
+ void resetPi();
+ double classifyCost() const;
+ const AlphaList bestHyperplane() const;
+ double currentValue() const;
+ double currentEntropy() const;
Figure B·6: Interface for CGrid and CCell classes.
169
CTask
− static int m_classInstanceCount;
− int m_id;
− int m_actionIndex, m_sensorIndex, m_modeIndex;
− PG * m_policyGraphs;
− int m_nNumActions, m_nCurrentDepth, m_nHorizon;
− int m_nInitNodeIndex;
− int m_strategyIndex;
− double * m_resourcesForTask/*[TotalSensorCount]*/;
− double m_value;
− double m_measureWeight;
− void setResourcesForTask(const bool reinit = false);
− void setTaskValue();
+ CTask(CCell& cell, const int sensorIndex, const int modeIndex);
+ CTask(CCell& cell, PG * policy_graphs, int currentDepth,
int numActions, int horizon, int initNodeIndex, int strategyIndex,
const double * resourcesForTask = NULL);
+ virtual ~CTask();
+ const int actionIndex() const;
+ const int sensorIndex() const;
+ const int modeIndex() const;
+ const int strategyIndex() const;
+ const int id() const;
+ const double value() const;
+ const int initNodeIndex() const;
+ const int numActions() const;
+ const int currentDepth() const;
+ const int horizon() const;
+ double measureWeight() const;
+ double resourcesForTask(const int sensorIndex);
+ double resourcesForTask() const;
+ const double * resourcesForTaskPtr() const;
+ void setMeasureWeight(const double weight);
+ int nextNodeIndex(int obs) const;
+ CTask createFollowUpTask(int obs,
const double * remainingResourceCost/*[TotalSensorCount]*/);
+ void computeRemainingResourceCost(
double * remainingResourceCost/*[TotalSensorCount]*/, int obs) const;
+ bool operator < (const CTask& right) const;
+ CCell& cell() const;
+ static const int classInstanceCount();
+ static void resetClassInstanceCount();
Figure B·7: Interface for CTask class.
References
Anderson, J. D. (2006). Methods and metrics for human control of multi-robot teams.
Master’s thesis, Brigham Young University.
Athans, M. (1972). On the determination of optimal costly measurement strategies for
linear stochastic systems. Automatica, 8(4):397–412.
Baillieul, J. and Baronov, D. (2010). Information acquisition in the exploration of
random fields. In Hu, X. and Ghosh, B., editors, Three Decades of Progress in
Control. Springer.
Baker, M. and Yanco, H. (2004). Autonomy mode suggestions for improving human-
robot interaction. In Proceedings of the 2004 IEEE International Conference on
Systems, Man and Cybernetics, volume 3, pages 2948–2953.
Baronov, D. and Baillieul, J. (2010). Topology guided search of potential fields.
preprint (2010).
Bashan, E., Raich, R., and Hero, A. (2007). Adaptive sampling: Efficient search
schemes under resource constraints. www.eecs.umich.edu/~bashan/cspl-385.pdf.
Bashan, E., Raich, R., and Hero, A. (2008). Optimal two-stage search for sparse targets
using convex criteria. IEEE Transactions on Signal Processing, 56(11):5389–5402.
Bellman, R. E. (1957). Dynamic programming. Princeton University Press.
Benkoski, S. J., Monticino, M. G., and Weisinger, J. R. (1991). A survey of the search
theory literature. Naval Research Logistics, 38(4):469–494.
Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 1–2.
Athena Scientific, 3 edition.
Bertsimas, D. and Tsitsiklis, J. (1997). Introduction to Linear Optimization. Athena
Scientific.
Bruemmer, D. J. and Walton, M. C. (2003). Collaborative tools for mixed teams of
humans and robots. In Proceedings of the Workshop on Multi-Robot Systems, pages
219–229.
170
171
Cassandra, A. R. (1999). Tony’s pomdp-solve page. http://www.cassandra.org/
pomdp/code/index.shtml.
Castanon, D. and Wohletz, J. (2009). Model predictive control for stochastic resource
allocation. IEEE Transactions on Automatic Control, 54(8):1739–1750.
Casta˜n´on, D. A. (1995). Optimal search strategies in dynamic hypothesis testing. IEEE
Transactions on Systems, Man and Cybernetics, 25(7):1130–1138.
Casta˜n´on, D. A. (1997). Approximate dynamic programming for sensor management.
In Proceedings of the 36th IEEE Conference on Decision and Control, pages 1202–
1207.
Casta˜n´on, D. A. (2005a). A lower bound on adaptive sensor management performance
for classification. preprint (2005).
Casta˜n´on, D. A. (2005b). Stochastic control bounds on sensor network performance. In
Proceedings of the 44th IEEE Conference on Decision and Control, pages 4939–4944.
Casta˜n´on, D. A. and Carin, L. (2008). Stochastic control theory for sensor management.
In Hero, A., Casta˜n´on, D., Cochran, D., and Kastella, K., editors, Foundations and
Applications of Sensor Management. Springer Verlag, New York, NY.
Casta˜n´on, D. A. and Wohletz, J. (2002). Model predictive control for dynamic unreliable
resource allocation. In Proceedings of the 41st IEEE Conference on Decision and
Control, volume 4, pages 3754–3759.
Chernovv, H. (1972). Sequential Analysis and Optimal Design. Society for Industrial
and Applied Mathematics.
Chong, E., Kreucher, C., and Hero, A. (2008a). Monte-carlo-based partially observable
Markov decision process approximations for adaptive sensing. International Work-
shop on Discrete Event Systems, pages 173–180.
Chong, E., Kreucher, C., and Hero, A. (2008b). Partially observable Markov decision
process approximations for adaptive sensing. preprint (2008).
Crandall, J. and Cummings, M. (2008). A predictive model for human-unmanned vehi-
cle systems. Technical report, MIT Humans and Automation Laboratory, Cambridge,
MA.
Cummings, M. and Mitchell, P. (2008). Predicting controller capacity in supervisory
control of multiple UAVs. IEEE Transactions on Systems, Man and Cybernetics,
Part A: Systems and Humans, 38(2):451–460.
172
Cummings, M. L., Mitchell, P. M., and Sheridan, T. B. (2005). Human supervisory
control challenges in network centric operations. Technical report, In Human Systems
Information Analysis Center (HSIAC) (Ed.), State of the Art Report. Dayton, OH:
AFRL.
Cummings, M. L. and Morales, D. (2005). UAVs as tactical wingmen: Control methods
and pilots’ perceptions. Unmanned Systems, 23(1):25–27.
Dantzig, G. B. and Wolfe, P. (1961). The decomposition algorithm for linear programs.
Econometrica, 29(4):767–778.
Dudenhoeffer, D. D. (2001). Command and control architectures for autonomous micro-
robotic forces - FY-2000 project report. Technical report, Idaho National Laboratory.
Dudenhoeffer, D. D., Bruemmer, D. J., and Davis, M. L. (2001). Modeling and simula-
tion for exploring human-robot team interaction requirements. In Proceedings of the
33nd conference on Winter simulation, pages 730–739, Washington, DC, USA. IEEE
Computer Society.
Endsley, M. R. (1988). Design and evaluation for situation awareness enhancement.
In Proceedings of the Human Factors Society 32nd Annual Meeting, volume 1, pages
97–1010. Human Factors Society.
Fedorov, V. V. (1972). Theory of Optimal Experiments. Academic Press, New York.
Freedy, A., DeVisser, E., Weltman, G., and Coeyman, N. (2007). Measurement of
trust in human-robot collaboration. In International Symposium on Collaborative
Technologies and Systems 2007, pages 106–114.
Gerkey, B. P. and Matari´c, M. J. (2004). A formal analysis and taxonomy of task
allocation in multi-robot systems. The International Journal of Robotics Research,
23(9):939–954.
Gilmore, P. C. and Gomory, R. E. (1961). A linear programming approach to the
cutting-stock problem. Operations Research, 9(6):849–859.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the
Royal Statistical Society. Series B, 41(2):148–177.
Goodrich, M. A., Jr., D. R. O., Crandall, J. W., and Palmer, T. J. (2001). Experi-
ments in adjustable autonomy. In Proceedings of the International Joint Conference
on Artificial Intelligence (IJCAI) Workshop on Autonomy, Delegation and Control:
Interacting with Intelligent Agents, pages 1624–1629.
Grocholsky, B. (2002). Information-Theoretic Control of Multiple Sensor Platforms.
PhD thesis, University of Sydney.
173
Grocholsky, B., Makarenko, A., and Durrant-Whyte, H. (2003). Information-theoretic
coordinated control of multiple sensor platforms. In Proceedings of the IEEE Inter-
national Conference on Robotics and Automation, pages 1521–1526.
Hitchings, D. C. and Casta˜n´on, D. A. (2010). Receding horizon stochastic control algo-
rithms for sensor management. In Proceedings of the American Control Conference,
Baltimore, MD.
Jenkins, K. (2010). Adaptive Sensor Management for Feature-Based Classification.
PhD thesis, Boston University.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in
partially observable stochastic domains. Artificial Intelligence, 101:99–134.
Kastella, K. (1996). Discrimination gain for sensor management in multitarget detection
and tracking. In Computational Engineering in Systems Application (CESA) 1996:
Proceedings of the IEEE-Systems, Man, and Cybernetics (SMC) and International
Association for Mathematics and Computers in Simulation (IMACS) Multiconference,
volume 1, pages 167–172.
Kastella, K. (1997). Discrimination gain to optimize detection and classification. IEEE
Transactions on Systems, Man and Cybernetics, Part A, 27(1):112–116.
Kaupp, T. and Makarenko, A. (2008). Measuring human-robot team effectiveness to
determine an appropriate autonomy level. In Proceedings of the 2008 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), pages 2146–2151.
Kiefer, J. C. (1959). Optimum experimental designs. Journal of the Royal Statistical
Society Series B, 21:272–319.
Koopman, B. (1980). Search and Screening: General Principles with Historical Appli-
cations. Pergamon, New York NY.
Koopman, B. O. (1946). Search and Screening. Operations Evaluation Group Report
No. 56. Technical report, Center for Naval Analyses, Alexandria, VA.
Kreucher, C. and Hero, A. (2006). Monte carlo methods for sensor management in
target tracking. In Proceedings of the IEEE Nonlinear Statistical Signal Processing
Workshop.
Kreucher, C., Kastella, K., and Alfred O. Hero, I. (2005). Sensor management using an
active sensing approach. Signal Processing, 85(3):607–624.
Krishnamurthy, V. and Evans, J. (2001a). Optimal sensor scheduling for hidden Markov
model state estimation. International Journal of Control, 74(18):1737–1742.
174
Krishnamurthy, V. and Evans, R. (2001b). Hidden Markov model multiarm bandits:
a methodology for beam scheduling in multitarget tracking. IEEE Transactions on
Signal Processing, 49(12):2893–2908.
Leung, J. Y.-T. (2004). Handbook of scheduling: algorithms, models, and performance
analysis, volume 1. CRC Press. Chapman & Hall/CRC computer and information
science series.
Lindley, D. V. (1956). On a measure of the information provided by an experiment.
Annals of Mathematical Statistics, 27:986–1005.
Littman, M. L. (1994). The Witness algorithm: Solving partially observable Markov
decision processes. Technical report, Brown University, Department of Computer
Science, Providence, RI.
Lovejoy, W. S. (1991a). Computationally feasible bounds for partially observed Markov
decision processes. Operations Research, 39(1):162.
Lovejoy, W. S. (1991b). A survey of algorithmic methods for partially observed Markov
decision processes. Annals of Operations Research, 28(1-4):47–66.
McMahan, H. B. (2006). Robust Planning in Domains with Stochastic Outcomes, Ad-
versaries, and Partial Observability. PhD thesis, Carnegie Mellon University.
Monahan, G. E. (1982). A survey of partially observable Markov decision processes:
Theory, models, and algorithms. Management Science, 28(1):1–16.
Nehme, C. E. and Cummings, M. L. (2007). An analysis of heterogeneity in futuristic
unmanned vehicle systems. Technical report, MIT Dspace.
Patrascu, R.-E. (2004). Linear Approximations For Factored Markov Decision Pro-
cesses. PhD thesis, University of Waterloo.
Pineau, J., Gordon, G., and Thrun, S. (2003). Point-based value iteration: An anytime
algorithm for POMDPs. In Proceedings of the International Joint Conference on
Artificial Intelligence, pages 1025–1032.
Raghunathan, D. and Baillieul, J. (2010). Search decisions in a game of polynomial
root counting. preprint (2010).
Rangarajan, R., Raich, R., and Hero, A. (2007). Optimal sequential energy allocation
for inverse problems. IEEE Journal on Selected Topics in Signal Processing, 1(1):67–
78.
175
Schermerhorn, P. and Scheutz, M. (2009). Dynamic robot autonomy: Investigating
the effects of robot decision-making in a human-robot team task. In Proceedings of
the 2009 International Conference on Multimodal Interfaces, pages 63–70, New York,
NY, USA. ACM.
Schmaedeke, W. W. (1993). Information-based sensor management. In Kadar, I.
and Libby, V., editors, Proceedings of Signal Processing, Sensor Fusion, and Target
Recognition II, volume 1955, pages 156–164. SPIE.
Schmaedeke, W. W. and Kastella, K. D. (1994). Event-averaged maximum likelihood
estimation and information-based sensor management. In Kadar, I. and Libby, V.,
editors, Proceedings of Signal Processing, Sensor Fusion, and Target Recognition III,
volume 2232, pages 91–96. SPIE.
Schneider, M., Mealy, G., and Pait, F. (2004). Closing the loop in sensor fusion systems:
stochastic dynamic programming approaches. In Proceedings of the American Control
Conference, volume 5, pages 4752–4757.
Scholtz, J. C. (2002). Human-Robot Interactions: Creating synergistic cyber forces,
pages 177–184. Kluwer Academic Publishers.
Sellers, D. (1996). A survey of approaches to the job shop scheduling problem. In
Proceedings of the Twenty-Eighth Southeastern Symposium on System Theory, pages
396–400.
Smallwood, R. D. and Sondik, E. J. (1973). The optimal control of partially observable
Markov processes over a finite horizon. Operations Research, 21(5):1071–1088.
Steinfeld, A., Fong, T., Kaber, D. B., Lewis, M., Scholtz, J., Schultz, A. C., and
Goodrich, M. A. (2006). Common metrics for human-robot interaction. In Hu-
man Robot Interaction, pages 33–40.
Stone, L. D. (1975). Theory of Optimal Search. Academic Press.
Stone, L. D. (1977). Search theory: A mathematical theory for finding lost objects.
Mathematics Magazine, 50(5):248–256.
Tebboth, J. R. (2001). A computational study of Dantzig-Wolfe decomposition. PhD
thesis, University of Buckingham.
United States Department of Defense (2009). Unmanned systems integrated roadmap
2009-2034. Technical report, Office of the Secretary of Defense, U.S.A. www.acq.
osd.mil/uas/docs/UMSIntegratedRoadmap2009.pdf.
Wald, A. (1943). On the efficient design of statistical investigations. The Annals of
Mathematical Statistics, 14:134–140.
176
Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of Mathematical
Statistics, 16(2):117–186.
Washburn, R., Schneider, M., and Fox, J. (2002). Stochastic dynamic programming
based approaches to sensor resource management. In Proceedings of the 5th Interna-
tional Conference Information Fusion, volume 1, pages 608–615.
Williams, J., Fisher, J., and Willsky, A. (2005). An approximate dynamic program-
ming approach for communication constrained inference. In Proceedings of the IEEE
Workshop on Statistical Signal Processing, pages 1202–1207.
Williams, J. L. (2007). Information Theoretic Sensor Management. PhD thesis, Mas-
sachusetts Institute of Technology.
Wintenby, J. and Krishnamurthy, V. (2006). Hierarchical resource management in
adaptive airborne surveillance radars. IEEE Transactions on Aerospace and Elec-
tronic Systems, 42(2):401–420.
Wong, E.-M., Bourgault, F., and Furukawa, T. (2005). Multi-vehicle Bayesian search
for multiple lost targets. In Proceedings of the 2005 IEEE International Conference
on Robotics and Automation, pages 3169–3174.
Yost, K. A. and Washburn, A. R. (2000). The LP/POMDP marriage: Optimization
with imperfect information. Naval Research Logistics, 47(8):607–619.

dissertation

  • 1.
    ' & $ % ADAPTIVE MULTI-PLATFORM SEARCHAND EXPLOITATION DARIN CHESTER HITCHINGS Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy BOSTON UNIVERSITY
  • 3.
    BOSTON UNIVERSITY COLLEGE OFENGINEERING Dissertation ADAPTIVE MULTI-PLATFORM SEARCH AND EXPLOITATION by DARIN CHESTER HITCHINGS B.S., University of California, San Diego, 2000 M.S., Boston University, 2002 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010
  • 4.
    Approved by First Reader DavidA. Casta˜n´on, Ph.D. Professor of Electrical and Computer Engineering Second Reader John Baillieul, Ph.D. Professor of Mechanical Engineering Third Reader Christos G. Cassandras, Ph.D. Professor of Electrical and Computer Engineering Fourth Reader Prakash Ishwar, Ph.D. Assistant Professor of Electrical and Computer Engineering
  • 5.
    Men have becomethe tools of their tools. Henry David Thoreau
  • 6.
    Acknowledgments I have hadtons of support from professors, teachers, friends and family over the years since I undertook the goal of obtaining a Ph.D., quite naively, at age 11. First of all, I want to heartily thank my adviser, Prof. David Casta˜n´on. David is an extremely gifted applied mathematician and a phenomenally good adviser. I’ve also had many excellent courses with him. I am incredibly fortunate to have had the chance to work with David. I also want to thank my committee members Prof. John Baillieul, Prof. Christos Cassandras, Prof. Prakash Ishwar and Prof. Ajay Joshi for their involvement and feedback with my dissertation. I want to thank my father, Todd, for beginning me on the path of mathematics when I was very young. My dad has a very analytical mind, and he initiated me not only in word problems and logic, but also in programming. The resources he gave me in elementary school to work with electronic circuits and program computers have shaped my career. My beloved mother, Valarie, I thank for all her love and support over the years. Her appreciation of the fine arts, history and especially languages are precious to me. She’s been there every step of the way from teaching me how to read and write at age 5 to copy-editing my dissertation at present. I thank my older brother Sean for the the good times we have shared, the lessons on Calculus and the competition, which has made me stronger. Making a BASIC program to call me a turkey at Thanksgiving when I was 6–7 surely motivated my interest in computers! Sean’s knowledge of literature is the most encyclopedic of anyone’s I will ever know. I thank my aunt Karen and uncle Jon for the fine examples they have set with respect to their education, their careers, their awesome adventures and the gentle path they tread through life. Jay, my younger brother, has taught me much of what I know about people. He saved me from drowning when we moved to Florida; I owe him everything. He is greatly respected and adored by all who know him. iv
  • 7.
    Mr. H.T. Payne,Mr. Ted Brecke, Mrs. Marilyn Hoffacker, Mrs. Patricia Franks, Mr. Dale Russel, Dr. Oshri Karmon, Prof. Anthony Sebald and Mlle. Mireille Chazalviel have all been my mentors in life and helped set me on this path; my success is theirs. I’ve worked closely with my friends Karen Jenkins and Rohit Kumar and very much appreciate their feedback and advice over the years. Thanks to Ye Wang for getting me going with Linux (in addition to his friendship)! Thanks very much to Sonia Pujol for the help with my defense and LATEX. Thanks to George Atia for his friendship and advice. My thanks to Chris Karl and Shameek Gupta and all my friends inside and outside the ISS lab (too many to name) who have afforded me an awesome graduate experience. I love my friends and am very grateful for their camaraderie, support and discourse. Jessica, my girlfriend, I thank for the support, affection, adventures and for spoiling me shamelessly with her cuisine `a la fran¸caise. I would like to thank Prof. Janusz Konrad for the dissertation template and Prof. Clem Karl for the excellent class. My thanks to both of these professors, Mr. Daniel Kamalic and Mr. James Goebel for their dedication to IT issues in the ISS lab. I’ve very much appreciated my conversations with Prof. Selim ¨Unl¨u, Prof. Prakash Ishwar, Prof. Robert Kotiuga, Prof. Josh Semeter, Prof. Franco Cerrina and Mr. Jeff Murphy concerning how to improve graduate student life on campus in my capacity as president of the Student Association of Graduate Engineers (SAGE) for 2009–2010. Thanks to all the SAGE officers, especially Chris Garay and Ye Wang, in addition to Cheryl Kelly and Helaine Friedlander for what they gave of themselves to this university. Last of all, I thank the Air Force Office of Scientific Research and the Office of the Director, Defense Research & Engineering for providing support for this dissertation under grants FA9550-06-1-0324, FA9550-07-1-0361 and FA9550-07-1-0528. v
  • 8.
    ADAPTIVE MULTI-PLATFORM SEARCHAND EXPLOITATION (Order No. ) DARIN CHESTER HITCHINGS Boston University, College of Engineering, 2010 Major Professor: David A. Casta˜n´on, Ph.D. , Professor of Electrical and Computer Engineering ABSTRACT Recent improvements in the capabilities of autonomous vehicles have motivated their increased use in such applications as defense, homeland security, environmental moni- toring and surveillance. To enhance performance in these applications, new algorithms are required to control teams of robots autonomously and through limited interactions with human operators. In this dissertation we develop new algorithms for control of robots performing information-seeking missions in unknown environments. These mis- sions require robots to control their sensors in order to discover the presence of objects, keep track of the objects and learn what these objects are, given a fixed sensing budget. Initially, we investigate control of multiple sensors, with a finite set of sensing options and finite-valued measurements, to locate and classify objects given a limited budget. The control problem is formulated as a Partially Observed Markov Decision Problem (POMDP), but its exact solution requires excessive computation. Under the assumption that sensor error statistics are independent and time-invariant, we develop a class of algorithms using Lagrangian Relaxation techniques to obtain optimal mixed strategies using performance bounds developed in previous research. We investigate alternative vi
  • 9.
    Receding Horizon controllersto convert the mixed strategies to feasible adaptive-sensing strategies, and evaluate the relative performance of these controllers in simulation. The resulting controllers provide superior performance to alternative algorithms proposed in the literature, and obtain solutions to large-scale POMDP problems several orders of magnitude faster than optimal dynamic programming approaches with comparable performance quality. We extend our results for finite action, finite measurement sensor control to scenarios with moving objects. We use Hidden Markov Models (HMMs) for the evolution of objects, according to the dynamics of a birth-death process. We develop a new lower bound on the performance of adaptive controllers in these scenarios, develop algorithms for computing solutions to this lower bound, and use these algorithms as part of a Receding Horizon controller for sensor allocation in the presence of moving objects. We also consider an adaptive-search problem where sensing actions are continuous and the underlying measurement space is also continuous. We extend our previous hierarchical decomposition approach based on performance bounds to this problem, and develop novel implementations of Stochastic Dynamic Programming (SDP) techniques to solve this problem. Our algorithms are nearly two orders of magnitude faster than previously proposed approaches, and yield solutions of comparable quality. For supervisory control, we discuss how human operators can work with and augment robotic teams performing these tasks. Our focus is on how tasks are partitioned among teams of robots, and how a human operator can make intelligent decisions for task partitioning. We explore these questions through the design of a game that involves robot automata controlled by our algorithms and a human supervisor that partitions tasks based on different levels of support information. This game can be used with human subject experiments to explore the effect of information on quality of supervisory control. vii
  • 10.
    Contents 1 Introduction 1 1.1Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Scope and Contributions . . . . . . . . . . . . . . . . . . . . 4 1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background 7 2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Search Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Information Theory for Adaptive Sensor Management . . . . . . . 9 2.1.3 Multi-armed Bandit Problems . . . . . . . . . . . . . . . . . . . . 10 2.1.4 Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.5 Human-Robot Interactions and Human Factors . . . . . . . . . . 12 2.1.6 Summary of Background Work . . . . . . . . . . . . . . . . . . . 13 2.2 Sensor Management Formulation and Previous Results . . . . . . . . . . 14 2.2.1 Stationary SM Problem Formulation . . . . . . . . . . . . . . . . 14 2.2.2 Addressing the Search versus Exploitation Trade-off . . . . . . . . 27 2.2.3 Tracing Decision-Trees . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4 Violation of Stationarity Assumptions . . . . . . . . . . . . . . . 33 2.3 Column Generation And POMDP Subproblem Example . . . . . . . . . 34 3 Receding Horizon Control with Approximate, Mixed Strategies 39 3.1 Receding Horizon Control Algorithm . . . . . . . . . . . . . . . . . . . . 40 3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 viii
  • 11.
    4 Adaptive SMwith State Dynamics 55 4.1 Time-varying States Per Location . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Time-varying Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5 Adaptive Sensing with Continuous Action and Measurement Spaces 72 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Relaxed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Bayesian Objective Formulation . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6 Human-Robot Semi-Autonomous Systems 96 6.1 Optimizing Human-Robot Team Performance . . . . . . . . . . . . . . . 96 6.1.1 Differences Between Human and Machine World Models . . . . . 97 6.1.2 Human Decision-Making Response Time . . . . . . . . . . . . . . 97 6.1.3 Human and Machine Strengths and Weaknesses . . . . . . . . . . 98 6.1.4 Time-Varying Machine Autonomy . . . . . . . . . . . . . . . . . . 100 6.1.5 Machine Awareness of Human Inputs . . . . . . . . . . . . . . . . 100 6.2 Control Structures for HRI . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2.1 Verification of Machine Decisions . . . . . . . . . . . . . . . . . . 103 6.3 Strategy Game Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7 Conclusion 111 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 113 A Background Theory 116 A.1 Partially Observable Markov Decision Processes . . . . . . . . . . . . . . 116 A.2 Point-Based Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 124 ix
  • 12.
    A.3 Dantzig-Wolfe Decompositionand Column Generation for LPs . . . . . . 125 B Documentation for column gen Simulator 131 B.1 Build Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 B.2 Running column gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 B.3 Outputs of column gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B.4 Program Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B.5 Global variables in column gen . . . . . . . . . . . . . . . . . . . . . . . 143 B.6 Simulator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 References 170 x
  • 13.
    List of Tables 2.1Example of expanded sensor model for an SEAD mission scenario where the states are {‘empty’, ‘car’, ‘truck’, ‘SAM’} and the observations are ys,m = {o1 = ‘see nothing’, o2 = ‘civilian vehicle’, o3 = ‘military vehicle’} ∀ s, m. This setup models a single sensor with modes {u1 = ‘search’, u2 = ‘mode1’, u3 = ‘mode2’} where mode2 by definition is a higher- quality mode than mode1. Using mode1, trucks can look like SAMs, but cars do not look like SAMs. . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 Column Generation example with 100 objects. The tableau is displayed in its final form after convergence. λc s describe the lambda trajectories up until convergence. R1 and R2 are resource constraints. γ1 is a ‘do-nothing’ strategy. Bold numbers represent useful solution data. . . . . . . . . . . . . . . . . . 36 3.1 Observation likelihoods for different sensor modes with the observation symbols o1, o2 and o3. Low-res = ‘mode1’ and High-res = ‘mode2’. . . . . . . . . . . 43 3.2 Decision costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Simulation results for 2 homogeneous, multi-modal sensors in a search and classify scenario. str1: select the most likely pure strategy for all locations; str2: randomize the choice of strategy per location according to mixture prob- abilities; str3: select the strategy that yields the least expected use of resources for all locations. See Fig. 3·2 - Fig. 3·4 for the graphical version of this table. 46 xi
  • 14.
    3.4 Bounds forthe simulations results in Table 3.3. When the horizon is short, the 3 MPC algorithms execute more observations per object than were used to compute the “bound”, and therefore, in this case, the bounds do not match the simulations; otherwise, the bounds are good. . . . . . . . . . . . . . . . . 46 3.5 Comparison of lower bounds for 2 homogeneous, bi-modal sensors (left 3 columns) versus 2 heterogeneous sensors in which S1 has only ‘mode1’ available but S2 supports both ‘mode1’ and ‘mode2’ (right 3 columns). There is 1 visibility- group with πi(0) = [0.7 0.2 0.1]T ∀ i ∈ [0..99]. For many of the cases studied there is a performance hit of 10–20%. . . . . . . . . . . . . . . . . . . . . . 49 3.6 Comparison of sensor overlap bounds with 2 homogeneous, bi-modal, sensors and 3 visibility-groups. Both configurations use the prior πi(0) = [0.7 0.2 0.1]T . Compare and contrast with the left half of Table 3.5, most of the time the two sensors have enough objects in view to be able to efficiently use their resources for both the 60% and 20% overlap configurations; only the bold numbers are different. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7 Simulation results for 3 homogeneous sensors without using detection but with partial overlap as shown in Fig. 3·5. See Fig. 3·6 - Fig. 3·8 for the graphical version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.8 Bounds for the simulations results in Table 3.7. When the horizon is short, the 3 MPC algorithms execute more observations per object than were used to compute the bound, and therefore, in this case, the bounds do not match the simulations; otherwise, the bounds are good. . . . . . . . . . . . . . . . . . . 51 5.1 Performance comparison averaged over 100 Monte Carlo simulations. Re- laxation is the algorithm proposed in this chapter, while Exact is the algorithm of [Bashan et al., 2008] . . . . . . . . . . . . . . . . . . . . . . 93 xii
  • 15.
    List of Figures 2·1Illustrative example set of prior-probabilities πi(0) using a MATLAB “stem” plot for case where N = 9 and D = 2. Assuming objects are classified into 3 types, the Maximum-Likelihood estimate of locations xi with i ∈ {3, . . . , 6, 8, 9} is type 0 (empty). The ML estimate of xi for i ∈ {1, 2, 7} is type 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2·2 Hyperplanes from the Value Iteration algorithm that accompany Fig. 2·3. 23 2·3 Policy graphs for the optimal classification of a location with state {‘non- military’,‘military’}, two possible actions {‘Mode1’,‘Mode2’}, and two possible observations {‘y1’,‘y2’}. . . . . . . . . . . . . . . . . . . . . . . 23 2·4 This figure is a plot of expected cost (measurement+classification) versus MD for 3 different resource levels. The solid (blue) line gives the performance when the resources are pooled into one sensor and the dashed (red) line gives the performance when the resources are split across two sensors. . . . . . . . . . 25 2·5 Schematic showing how the master problem coordinates the activities of the POMDP subproblems using Column Generation and Lagrangian Relaxation. After the master problem generates enough columns to find the optimal values for the Lagrange multipliers, there is no longer any benefit to violating one of the resource constraints and the subproblems (with augmented costs) are decoupled in expectation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 xiii
  • 16.
    2·6 Illustration of“tracing” or “walking” a decision-tree for a POMDP sub- problem to calculate expected measurement and classification costs (the individual costs from the total). . . . . . . . . . . . . . . . . . . . . . . 32 2·7 Strategy 1 (mixture weight=0.726). πi(0) = [0.1 0.6 0.2 0.1]’ ∀ i ∈ [0, . . . , 9], πi(0) = [0.80 0.12 0.06 0.02]T ∀ i ∈ [10, . . . , 99]. The first 10 objects start with node 5, the remaining 90 start with node 43. The notation [i Ni0 Ni1 Ni2] indicates the next node/action from node i as a function of observing the 0th, 1st or 2nd observations respectively. . . . . . . . . . . . . . . . . . . . . . . 33 2·8 Strategy 2 (mixture weight=0.274) πi(0) = [0.1 0.6 0.2 0.1]’ ∀ i ∈ [0, . . . , 9], πi(0) = [0.80 0.12 0.06 0.02]T ∀ i ∈ [10, . . . , 99]. The first 10 objects start with node 6, the remaining 90 start with node 18. . . . . . . . . . . . . . . . . 34 2·9 The 3 pure strategies that correspond to columns 2, 5 and 6 of Table 2.2. The frequency of choosing each of these 3 strategies is controlled by the relative proportion of the mixture weight qc ∈ (0..1) with c ∈ {2, 5, 6}. . . . . . . . . 36 3·1 Illustration of scenario with two partially-overlapping sensors. . . . . . . . . 44 3·2 This figure is the graphical version of Table 3.3 for horizon 3. Simulation results for two sensors with full visibility and detection (X=’empty’, ’car’, ’truck’, ’mil- itary’) using πi(0) = [0.1 0.6 0.2 0.1]T ∀ i ∈ [0..9], πi(0) = [0.80 0.12 0.06 0.02]T ∀ i ∈ [10..99]. There is one bar in each sub-graph for each of the three simula- tion modes studied in this chapter. The theoretical lower bound can be seen in the upper-right corner of each bar-chart. . . . . . . . . . . . . . . . . . . 47 3·3 This figure is the graphical version of Table 3.3 for horizon 4. . . . . . . . . 47 3·4 This figure is the graphical version of Table 3.3 for horizon 6. . . . . . . . . 48 3·5 The 7 visibility groups for the 3 sensor experiment indicating the number of locations in each group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 xiv
  • 17.
    3·6 This figureis the graphical version of Table 3.7 for horizon 3. Situation with no detection but limited visibility (X=’car’, ’truck’, ’military’) using πi(0) = [0.70 0.20 0.10]T ∀ i ∈ [0..99]. There were 7 visibility-groups: 20x001, 20x010, 20x100, 12x011, 12x101, 12x110, 4x111. The 3 bars in each sub-graph are for ‘str1’, ‘str2’, ‘str3’ respectively. The theoretical lower bound can be seen in the upper-right corner of each bar-chart. . . . . . . . . . . . . . . . . . . . . . 52 3·7 This figure is the graphical version of Table 3.7 for horizon 4. . . . . . . . . 52 3·8 This figure is the graphical version of Table 3.7 for horizon 6. . . . . . . . . 53 4·1 An example HMM that can be used for each of the N locations. pa is an arrival probability and pd is a departure probability for the Markov chain. 57 5·1 Depiction of measurement likelihoods for empty and non-empty cells as a func- tion of xk0. √ xk0 gives the mean of the density p(Yk0|Ik = 1). If the cell is empty the observation is always mean 0 (black curve). . . . . . . . . . . . . 74 5·2 Waterfall plot of joint probability p(Yk0|Ik; xk0) for πk0 = 0.50 for xk0 ∈ [0 . . . 20]. This figure shows the increased discrimination ability that results from using higher-energy measurements (separation of the peaks). . . . . . . 75 5·3 Graphic showing the posterior probability πk1 as a function of the initial ac- tion xk0 and the initial measurement value Yk0. This surface plot is for λ = 0.01 and πk0 = 0.20. (The boundary between the high (red) and low (blue) portions of this surface is not straight but curves towards -y with +x.) . . . . . . . . 76 5·4 Cost function boundary (see Eq. 5.14) with λ = 0.011 and πk0 = 0.18. In the lighter region two measurements are made, in the darker region just one. (Note positive y is downwards.) . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 xv
  • 18.
    5·5 The optimalboundary for taking one action or two as a function of (xk0, Yk0) (for the [Bashan et al., 2008] cost function) for λ = 0.01 and πk0 = 0.20. The curves in Fig. 5·6 represent cross-sections through this surface for the 3 x-values referred to in that figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5·6 This figure gives another depiction of the optimal boundary between taking one measurement action or two for the [Bashan et al., 2008] cost function. For all Y (xk0, λ) ≥ 0 two measurements are made (and the highest curve is for the smallest xk0, see Fig. 5·5 for the 3D surface from which these cross-sections were taken). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5·7 Two-factor exploration to determine how the optimal boundary between taking one measurement or two measurements varies for a cell with the parameters (p, λ) where p = πk0 (for the [Bashan et al., 2008] problem cost function). Two measurements are taken in the darker region, one measurement for the lighter region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5·8 Plot of cost function samples associated with false alarms, missed detections and the optimal choice between false alarms and missed detections (for the Bayes’ cost function). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5·9 This figure shows cost-to-go function samples as a function of the second sensing-action xk1 and the second measurement Yk1 for the Bayes’ cost func- tion. These plots use 1000 samples for Yk1 and 100 for xk1. . . . . . . . . . . 87 5·10 Threshold function for declaring a cell empty (risk of MD) or occupied (risk of FA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5·11 The 0th stage resource allocation as a function of prior probability. The stria- tions are an artifact of the discretization of resources when looking for optimal xk0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 xvi
  • 19.
    5·12 Total resourceallocation to a cell as a function of prior probability. The point- wise sums of the 0th stage and 1st stage resource expenditures are displayed here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5·13 Cost associated with a cell as a function of prior probability. For the optimal resource allocations, there is a one-to-one correspondence between the cost of a cell and the resource utilized to sense a cell. . . . . . . . . . . . . . . . . 92 5·14 Cost-to-go from πk1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5·15 Optimal stage 1 energy allocations. . . . . . . . . . . . . . . . . . . . . . . 94 5·16 Stage 0 energy allocation versus prior probability . . . . . . . . . . . . . . . 95 6·1 Graphical User Interface (GUI) concept for semi-autonomous search and exploitation strategy game. . . . . . . . . . . . . . . . . . . . . . . . . . 106 A·1 Hyperplanes representing the optimal Value Function (cost framework) for the canonical Wald Problem [Wald, 1945] with horizon 3 (2 sensing opportunities and a declaration) for the equal missed detection and false alarm cost case: FA=MD. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A·2 Decision-tree for the Wald Problem. This figure goes with Fig. A·1. . . . 119 A·3 Example of 3D hyperplanes for a value function (using a reward formu- lation for visual clarity) for X = {‘military’,‘truck’,‘car’,‘empty’}, S = 1, M = 3 for a horizon 3 problem. The cost coefficients for the non-military vehicles were added together to create the 3D plot. This figure and Fig. A·4 are a mixed-strategy pair. . . . . . . . . . . . . . . . . . . . . . 122 A·4 Example of 3D hyperplanes representing the optimal value function re- turned by Value Iteration. The optimal value is the convex hull of these hyperplanes. This figure and Fig. A·3 are a mixed-strategy pair (see Section 2.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 xvii
  • 20.
    B·1 Sequence diagramof the startup process in column gen. . . . . . . . . . 159 B·2 Sequence diagram of how a sensor plan is constructed in column gen. . . 162 B·3 Sequence diagram of the update cycle in column gen. . . . . . . . . . . . 163 B·4 Interface for CSimulator class. . . . . . . . . . . . . . . . . . . . . . . . 166 B·5 Interface for CVehicle class. . . . . . . . . . . . . . . . . . . . . . . . . . 167 B·6 Interface for CGrid and CCell classes. . . . . . . . . . . . . . . . . . . . 168 B·7 Interface for CTask class. . . . . . . . . . . . . . . . . . . . . . . . . . . 169 xviii
  • 21.
    List of DomainSpecific Abbreviations EO . . . . . . . . . . . . . . . . . . . . . . . . Electro-Optical FOV . . . . . . . . . . . . . . . . . . . . . . . . Field of View FTI . . . . . . . . . . . . . . . . . . . . . . . . Fixed Target Indicator HRI . . . . . . . . . . . . . . . . . . . . . . . . Human-Robot Interactions IR . . . . . . . . . . . . . . . . . . . . . . . . Infrared LIDAR . . . . . . . . . . . . . . . . . . . . . . . . Light Detection and Ranging MTI . . . . . . . . . . . . . . . . . . . . . . . . Moving Target Indicator SAM . . . . . . . . . . . . . . . . . . . . . . . . Surface To Air Missile SAR . . . . . . . . . . . . . . . . . . . . . . . . Synthetic Aperture Radar SM . . . . . . . . . . . . . . . . . . . . . . . . Sensor Management SNR . . . . . . . . . . . . . . . . . . . . . . . . Signal-to-Noise Ratio UAV . . . . . . . . . . . . . . . . . . . . . . . . Unmanned Airborne Vehicle UGV . . . . . . . . . . . . . . . . . . . . . . . . Unmanned Ground Vehicle USV . . . . . . . . . . . . . . . . . . . . . . . . Unmanned Submersible Vehicle xix
  • 22.
    List of MathematicAbbreviations AP . . . . . . . . . . . . . . . . . . . . . . . . Assignment Problem BB . . . . . . . . . . . . . . . . . . . . . . . . Branch+Bound CLF . . . . . . . . . . . . . . . . . . . . . . . . Closed-loop Feedback (Control) DP . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Program(ming) FA . . . . . . . . . . . . . . . . . . . . . . . . False Alarm (Cost) HMM . . . . . . . . . . . . . . . . . . . . . . . . Hidden Markov Model IP . . . . . . . . . . . . . . . . . . . . . . . . Integer Program(ming) KL . . . . . . . . . . . . . . . . . . . . . . . . Kullback-Liebler (Distance) LP . . . . . . . . . . . . . . . . . . . . . . . . Linear Program(ming) MABP . . . . . . . . . . . . . . . . . . . . . . . . Multi-armed Bandit Problem MAP . . . . . . . . . . . . . . . . . . . . . . . . Maximum A Posteriori MD . . . . . . . . . . . . . . . . . . . . . . . . Missed Detection (Cost) MDP . . . . . . . . . . . . . . . . . . . . . . . . Markov Decision Process MILP . . . . . . . . . . . . . . . . . . . . . . . . Mixed Integer Linear Program(ming) ML . . . . . . . . . . . . . . . . . . . . . . . . Maximum Likelihood MPC . . . . . . . . . . . . . . . . . . . . . . . . Model Predictive Control(er) MSE . . . . . . . . . . . . . . . . . . . . . . . . Mean-Squared Error OLFC . . . . . . . . . . . . . . . . . . . . . . . . Open-Loop Feedback Control PBVI . . . . . . . . . . . . . . . . . . . . . . . . Point-Based Value Iteration PG . . . . . . . . . . . . . . . . . . . . . . . . Policy-Graph (Decision-Tree) POMDP . . . . . . . . . . . . . . . . . . . . . . . . Partially Observable Markov Decision Process PWLC . . . . . . . . . . . . . . . . . . . . . . . . Piece-wise Linear Convex RH . . . . . . . . . . . . . . . . . . . . . . . . Receding Horizon (Control) RMS . . . . . . . . . . . . . . . . . . . . . . . . Root Mean Square SCP . . . . . . . . . . . . . . . . . . . . . . . . Stochastic Control Problem SDP . . . . . . . . . . . . . . . . . . . . . . . . Stochastic Dynamic Program(ming) TSP . . . . . . . . . . . . . . . . . . . . . . . . Travelling Salesman Problem w.l.o.g. . . . . . . . . . . . . . . . . . . . . . . . . without loss of generality w.r.t. . . . . . . . . . . . . . . . . . . . . . . . . with respect to xx
  • 23.
    1 Chapter 1 Introduction 1.1 ProblemDescription Recent improvements in the hardware capabilities of robotic vehicles have led to increas- ing use of these devices in such applications as defense, search and rescue, homeland security, environmental monitoring and video surveillance. To extend and enhance these applications, new algorithms are required to control the behavior of teams of robots and to allow human operators to monitor and control them. Including a limited amount of human input into the decision-making process allows for more robust performance in accomplishing mission objectives without subjecting humans to unbearable stress and fatigue. In this dissertation, we address a class of problems that models search and exploitation missions by one or more semi or fully autonomous vehicles (UAVs, USVs or UGVs) with heterogeneous sensing capabilities. We focus on how to control vehicle sensors to discover as many objects as possible within a fixed time-window with limited human guidance. A “mission” is defined as a set of tasks to be accomplished within a known, fixed time-frame, the “mission time” in a fixed-size space, the “mission space”. There are numerous applications for autonomous search and exploitation techniques. In a military setting, UAVs can be tasked to explore a hostile area and identify locations of Surface-to-Air-Missiles (SAM) sites that are threatening to piloted aircraft. In a search and rescue scenario, robotic vehicles can be used to search millions of square miles of ocean looking for sailors lost at see. Unmanned vehicles can be used for environmental
  • 24.
    2 monitoring during forestfires or chemical spills. On Mars, rovers are used to explore an area studying geological features searching for signs of water and potentially past or present life. Other applications include urban law-enforcement and video-surveillance in airports. In this dissertation we are interested in making use of imprecise sensors to sense locations in the mission space where little information is known. We seek to exploit this noisy data to infer information about the identity of the objects we have sensed and to determine what to observe next in a near-optimal fashion. The noisy sensors on sensor platforms (robotic vehicles) are subject to resource constraints (e.g., the duty-cycle of a radar or the speed with which a camera can be pointed and focused). Resource constraints can be either cumulative across time or periodic in nature. In general, we assume that sufficient computational power is available to process sensor information as quickly as it can be captured, so processing power is not a significant constraint. We model the mission-space as a set of N locations where each location may contain objects of unknown type or be empty. The dyamic state of this problem is the collection of object types across the mission space at each time. This state is not directly observ- able, so sensors are used to make noisy observations of these locations. We combine information over time in a Bayesian framework to estimate the conditional probability of the state given past observations, which is a sufficient statistic for control (aka the information state or belief-state). Adaptive SM is then posed as the control problem of determining the locations to examine with the set of available sensor resources at each discrete time, as a function of the belief state, in order to discover as much information as possible about the underlying problem state. We leverage techniques from stochastic control theory and combinatorial optimiza- tion theory to develop near-optimal control policies that adapt to information that is
  • 25.
    3 learned throughout amission. One central theme of this work is to address the question of the optimal balance between search (exploration) versus exploitation: how to charac- terize the optimal balance between spending time learning new information by searching new locations for hitherto unknown objects (exploration) versus spending time making use of the information already available to characterize the identity of objects known to exist in pre-determined locations. This trade-off is a question of how to optimally partition the allocation of a scarce resource across multiple tasks that compete for this resource. This type of optimization problem is challenging because there is a combina- torial action space, an imperfectly observed state space and because we are interested in adaptive feedback strategies (dynamic optimization techniques) that keep track of all possible sensing outcomes and future courses of action. We assume a centralized control architecture. Thus a central controller coordinates the actions of each robot actor that participates in the mission, and the controller has access to all of the information from each of the autonomous vehicles without large communication bottlenecks. The central controller must 1) compute feasible search plans and object follow-up actions and 2) communicate these plans to the autonomous vehicles without undue delays so that each vehicle can execute its part of the plan. Because robots have heterogeneous capabilities, different roles for the vehicles will emerge from a control algorithm that tries to optimally manage the resources of individual vehicles. In the controls community, a large distinction is made between a myopic (aka greedy or short-sighted) control/planning strategy versus a non-myopic (aka far-sighted or opti- mal) control strategy. While it is possible that a myopic strategy could be optimal, this is frequently not the case except in special circumstances. Another distinction is made between an open-loop (non-adaptive) algorithm and a closed-loop (adaptive) algorithm. An open-loop algorithm follows a sequence of actions, whereas a closed-loop algorithm is capable of dynamically changing its actions based on the information collected. This
  • 26.
    4 dissertation focuses onalgorithms that are both non-myopic and adaptive. While these algorithms are generally the most computationally demanding, they have the highest performance. A real-time control algorithm needs planning information at a rate of 1–100 Hz, and only so much calculation can be completed before sensing decisions must be made. Because of the combinatorial nature of the problem, computing every possible place that sensors can look at every possible future time-instant and considering every possible action that each sensor can take at each of these places is hopelessly complicated. We develop algorithms that can rapidly search the decision space to compute desired control actions in reasonable computation time. We make the assumption that the state of each location is statistically independent of the states of other locations, which helps us decompose the Stochastic Control Problem for Sensor Management/Sensor Resource Allocation (SM) into subproblems of a tractable size. Most autonomous systems require human guidance and supervision. In order to explore proper roles for humans and automata in an SM system, we partition all of the tasks necessary for SM into subsets and consider which tasks should be accom- plished by a machine and which tasks should be accomplished by humans. We propose semi-autonomous control algorithms that incorporate human input on a high level and automated (machine) decision-making on a lower level. We discuss multiple candidate models for the best way of coordinating between human and automata with the goal of developing a rubric for the most-important activities that a human actor/operator can perform while interacting with a semi-autonomous system. 1.2 Dissertation Scope and Contributions In this dissertation we develop new algorithms for control of robots performing information- seeking missions in unknown environments. These algorithms can be used to control
  • 27.
    5 multiple, multi-modal, resource-constrainedsensors on autonomous vehicles by specify- ing where and how they should be used to maximize one of several possible performance metrics for the mission. The goals of the mission are application-dependent, but at the very least they will include accurately locating, observing and classifying objects in the mission space. The SM control problem can be formulated as a Partially Observed Markov Decision Problem (POMDP), but its exact solution requires excessive computation. Instead, we use Lagrangian Relaxation techniques to decompose the original SM problem hierarchi- cally into POMDP subproblems with coupling constraints and a master problem that coordinates the prices of resources. To this end, this dissertation makes the following contributions: • Develops a Column Generation algorithm that creates mixed strategies for SM and implements this algorithm in fast, C language code. This algorithm creates sensor plans that near-optimally solve an approximation of the original SM problem, but generates mixed strategies that may not be feasible. The output strategies are programmatically visualized in MATLAB. • Develops alternatives for receding horizon control using these approximate, mixed strate- gies output from the Column Generation routine, and evaluates their performance using a fractional factorial design of experiments based on the above software. • Extends previous results in SM for classification of stationary objects to allow Markov dynamics and time varying visibility, obtaining new lower bounds characterizing achiev- able performance for these problems. These lower bounds can be used to develop receding horizon control strategies. • Develops new approaches for solution of dynamic search problems with continuous ac- tion and observation spaces that are much faster than previous optimal results, with near-optimal performance. We perform simulations of our algorithms in MATLAB and compare our results with those of the optimal algorithm from [Bashan et al., 2007,Bashan et al., 2008]. Our algorithm performs similarly to theirs but can be used for problems with non-uniform priors. • Designs a game to explore human supervisory control of automata controlled by our
  • 28.
    6 algorithms, in orderto explore the effects of different levels of information support on the quality of supervisory control. 1.3 Dissertation Organization The structure of the remainder of this dissertation is as follows: Chapter 2 is devoted to presenting a literature survey and background material that is pertinent to the SM problem. Chapter 2 also reviews theory from [Casta˜n´on, 2005b] that underlies the theoretical foundations of this dissertation’s results. Chapter 3 builds upon the results of Chapter 2 and discusses a RH algorithm for near-optimal SM in a scenario where objects have static state and sensor platforms are unconstrained in terms of where they look and when (no motion constraints). Chapter 4 discusses new algorithms for two extensions to the problem formulation of Chapter 3: 1) objects that can arrive and depart with a Markov birth-death process or 2) object visibility that is known but time- varying. Chapter 5 considers an adaptive search problem for sensors with continuous action and observation spaces and presents fast, near-optimal algorithms for the solution of these problems. Chapter 6 discusses some candidate strategies for mixed human/non- human, semi-autonomous robotic search and exploit teams and develops a game that can be used to explore human supervisory control of robotic teams. Chapter 7 summarizes this dissertation. Last of all, two appendices are included that provide some additional background theory and documentation for the simulator discussed in Chapter 3.
  • 29.
    7 Chapter 2 Background This chapterprovides both a literature survey and background material that will be referred to in later chapters. First we review existing techniques from various fields related to this dissertation. We describe why the algorithms presented in the litera- ture fail to address the problem we envision to the extent that is required for a search and exploitation system to be considered “semi-autonomous”, which we take to mean an Autonomous Capability Level (ACL) of 6 or higher on the DoDs “UAV Roadmap 2009” [United States Department of Defense, 2009]. Section 2.2 discusses the develop- ment of a lower bound for the achievable performance of a SM system from [Casta˜n´on, 2005b]. The last section in this chapter, Section 2.3, gives an example of our algo- rithm for computing SM plans. The implementation of the algorithm from [Casta˜n´on, 2005b], as demonstrated by this example, is the first contribution of this dissertation. For the interested reader, a brief review of theory pertaining to Partially Observable Markov Decision Processes (POMDPs), the Witness Algorithm and Point-Based Value Iteration (PBVI), Dantzig-Wolfe Decompositions and Column Generation is available in Appendix A. 2.1 Literature Survey Problems of search and exploitation have been considered in many fields such as Search Theory, Information-Theory, Multi-armed Bandit Problems (MABPs), Stochastic Con- trol, and Human-Robot Interactions. We will review relevant results in each of these
  • 30.
    8 areas in therest of this section. 2.1.1 Search Theory One of the earliest examples of Sensor Management (SM) arose in the context of Search, with application to anti-submarine warfare in the 1940’s [Koopman, 1946, Koopman, 1980]. In this context, Search Theory was used to characterize the optimal allocation of search effort to look for a single stationary object with a single imperfect sensor. Sensors had the ability to move spatially and allocate their search effort over time and space. In [Koopman, 1946] a worst-case search-performance rule is derived yielding the “Random Search Law” aka the “Exponential Detection Function” [Stone, 1977]. This work is extended in [Stone, 1975] to handle the case of a single moving object. A survey of the field of Search Theory is given by [Benkoski et al., 1991], which describes how most work in this domain focuses on open-loop search plans rather than feedback control of search trajectories. The main problem with most Search Theory results is that the search strategies are non-adaptive and the search ends after the object has been found. Extensions of Search Theory to problems requiring adaptive feedback strategies have been developed in some restricted contexts [Casta˜n´on, 1995] where a single sensor takes one action at a time. Recent work on Search has focused on deterministic control of search vehicle tra- jectories using different performance metrics. Baronov et al. [Baillieul and Baronov, 2010,Baronov and Baillieul, 2010] describe an information aquisition algorithm for the autonomous exploration of random, continuous fields in the context of environmental exploration, reconaissance and surveillance. Our focus in this thesis is on adaptive sensor scheduling based on noisy observations, and not on control of sensor-platform trajectories.
  • 31.
    9 2.1.2 Information Theoryfor Adaptive Sensor Management Adaptive SM has its roots in the field of statistics, in which Bayesian experiment design was used to configure subsequent experiments that were based on observed information. Wald [Wald, 1943,Wald, 1945] considered sequential hypothesis testing with costly ob- servations. Lindley [Lindley, 1956] and Kiefer [Kiefer, 1959] expanded the concepts to include variations in potential measurements. Chernoff [Chernovv, 1972] and Fe- dorov [Fedorov, 1972] used Cramer-Rao bounds for selecting sequences of measurements for nonlinear regression problems. Most of the strategies proposed for Bayesian experi- ment design involve single-step optimization criteria, resulting in “greedy” (or “myopic”) strategies that optimize bounds on the expected performance after the next experiment. Athans [Athans, 1972] considered a two-point boundary value approach to controlling the error covariance in linear estimators by choosing the measurement matrices. Other approaches to adaptive SM using single-stage optimization have been proposed with al- ternative information theoretic measures [Schmaedeke, 1993,Schmaedeke and Kastella, 1994,Kastella, 1997,Kreucher et al., 2005]. Most of the work on information theory approaches for SM is focused on tracking objects using linear or nonlinear estimation techniques [Kreucher et al., 2005,Wong et al., 2005,Grocholsky, 2002,Grocholsky et al., 2003] and use myopic (single stage) policies. Myopic policies generated by entropy-gain criteria perform well in certain scenarios, but they have no guarantees for optimality in dynamic optimization problems. Along these lines, the dissertation by Williams provides a set of performance bounds on greedy algorithms as compared to optimal closed-loop policies in certain situations [Williams, 2007].
  • 32.
    10 2.1.3 Multi-armed BanditProblems In the 1970s [Gittins, 1979] developed an optimal indexing rule for “Multi-armed Bandit Problems” (MABP) that is applicable to SM problems. In these approaches, different objects are modeled as ”bandits” and assigning a sensor to look at an object is equivalent to playing the ”bandit”, thereby changing the ”bandit’s” state. Krishnamurthy et al. [Krishnamurthy and Evans, 2001a,Krishnamurthy and Evans, 2001b] and Washburn et al. [Washburn et al., 2002] use MABP models to obtain SM policies for tracking moving objects. The MABP model limits their work to policies that use a single sensor with a single mode, so only one object can be observed at a time. 2.1.4 Stochastic Control Stochastic control approaches to SM problems are often posed as Stochastic Control Problems and solved using Dynamic Programming techniques [Bertsekas, 2007]. Evans and Krishnamurthy [Krishnamurthy and Evans, 2001a] use a Hidden Markov Model (HMM) to represent object dynamics while planning sensor schedules. Using a Stochas- tic Dynamic Programming (SDP) approach, optimal policies are found for the cost functions studied. While the proposed algorithm provides optimal sensor schedules for multiple sensors, it only deals with one object. Several authors have recently proposed approximate Stochastic Dynamic Program- ming techniques for SM based on value function approximations or reinforcement learn- ing [Wintenby and Krishnamurthy, 2006,Kreucher and Hero, 2006,Chong et al., 2008a, Washburn et al., 2002,Schneider et al., 2004,Williams et al., 2005,Chong et al., 2008a, Chong et al., 2008b]. The majority of these results are focused on the problem of track- ing objects. Furthermore, the proposed approaches are focused on small numbers of objects, and fail to address the range and scale of the problems of interest in this dis- sertation. A good overview of approximate DP techniques is available in [Casta˜n´on and
  • 33.
    11 Carin, 2008]. Bashan, Reichand Hero [Bashan et al., 2007, Bashan et al., 2008] use DP to solve a class of two-stage adaptive sensor allocation problems for search with large numbers of possible cells. The complexity of their algorithm restricts its application to problem classes where every cell has a uniform prior. Similar results were obtained for an imaging problem in [Rangarajan et al., 2007]. In this thesis, we develop a different approach that overcomes this limitation in Ch. 5. In [Yost and Washburn, 2000], Yost describes a hierarchical algorithm for resource allocation using a Linear Program (LP) at the top level (the “master problem”) to coordinate a set of POMDP subproblems in a Battle Damage Assessment (BDA) setting. This work is similar to the approach of this dissertation, except we are concerned with more complicated POMDP subproblems. The problem of unreliable resource allocation is discussed in [Casta˜n´on and Wohletz, 2002,Castanon and Wohletz, 2009] in which a pool of M resources is assigned to complete N failure-prone tasks over several stages using an SDP formulation. Casta˜n´on proposes a receding-horizon control approach to solve a relaxed DP problem that has an execution time nearly linear in the number of tasks involved, however this work does not handle a partially observable state. Most approaches for dynamic feedback control are limited in application to problems with a small number of sensor-action choices and simple constraints because the algo- rithms must enumerate and evaluate the various control actions. In [Casta˜n´on, 1997], combinatorial optimization techniques are integrated into a DP formulation to obtain approximate SDP algorithms that extend to large numbers of sensor actions. Subsequent work in [Casta˜n´on, 2005b] derives an SDP formulation using partially observed Markov decision processes (POMDPs) and obtains a computable lower bound to the achievable performance of feedback strategies for complex multi-sensor, SM problems. The lower
  • 34.
    12 bound is obtainedby a convex relaxation of the original combinatorial POMDP using mixed strategies and averaged constraints. However, the results in [Casta˜n´on, 2005b] do not specify algorithms with performance close to the lower bound (see Section 2.2). This dissertation describes such an algorithm in Ch. 3 and then proposes theoretical extensions to this algorithm in Ch. 4. 2.1.5 Human-Robot Interactions and Human Factors The use of simulation as a technique to explore the best means of Human-Robot Inter- action (HRI) in teams with multiple robots per human is the subject of [Dudenhoeffer et al., 2001]. Questions of human situational awareness, mode awareness (what the robot is currently doing), and mental model formulation are discussed. In [Raghunathan and Baillieul, 2010], a search game involving the identification of roots of random polynomials is presented. The paper analyses the search versus exploration trade-off made by players and develops Markov models that emulate the style of play of the 18 players involved in the experiments indistinguishably w.r.t. a performance metric. The SAGAT tool for measuring situational awareness has gained wide acceptance in the literature [Endsley, 1988]. This tool is important for estimating an operator’s ability to adequately control a team of robots and avoid mishaps. In the M.S. thesis of [Anderson, 2006], various hierarchical control structures are described for the human control of multiple robots. A game of tag is played inside a maze by two teams of three robots controlled at 5 levels of autonomy, and various metrics for human and robot performance are studied. Robots in this work are either tele-operated or move myopically, and sensor measurements are noiseless within a certain range. In [Scholtz, 2002], possible HRI interactions are divided up into three categories, and
  • 35.
    13 it is speculatedthat each category of interaction requires different types of information and a different interface. This work suggests that a system with multiple levels of autonomy requires different kinds kind of interfaces according to its mode of operation and needs to be able to transition between them without confusing the human operator. In [Cummings et al., 2005], Cummings discusses a list of issues that need to be addressed to achieve the military’s vision for Network Centric Warfare (NCW). The author states that to improve system performance, systems must move from a paradigm of Management by Consent (MBC) to Management by Exception (MBE). A system for predicting human performance in tasking multiple robotic vehicles is discussed in [Crandall and Cummings, 2008]. Human behavior is predicted by generating several stochastic models for 1) the amount of time humans need to issue commands and 2) the amount of time humans need to switch between tasks. Several performance metrics are also presented for situational awareness, for the effectiveness of an operator’s communications with a robot, and for the success of robot behavior while untasked. These references for HRI focus on human situational awareness, performance metrics, and various control strategies for human control of automata in simple environments. These references do not investigate the question of an optimal means of HRI in a semi- autonomous system in which robots with noisy, resource-constrained sensors are used to explore an unknown, partially-observable and dynamic environment using non-myopic and adaptive search strategies. 2.1.6 Summary of Background Work As the above discussion indicates, the research to date has focused on only parts of the problem of interest in this dissertation. The methods used in the existing body of research need to be merged and unified in an intelligent fashion such that a semi- autonomous search, plan, and execution system is created that behaves cohesively and
  • 36.
    14 in a non-myopic,adaptive fashion. In this dissertation, we develop and implement algorithms for the efficient compu- tation of adaptive SM strategies for complex problems involving multiple sensors with different observation modes and large numbers of potential object locations. The al- gorithms we present are based on using the lower bound formulation from [Casta˜n´on, 2005b] as an objective in a RH optimization problem and on developing techniques for obtaining feasible sensing actions from (generally infeasible) mixed strategy solutions. These algorithms support the use of multiple, multi-modal, resource-constrained, noisy sensors operating in an unknown environment in a search and classification context. The resulting near-optimal, adaptive algorithms are scalable to large numbers of tasks, and suitable for real-time SM. 2.2 Sensor Management Formulation and Previous Results In this section, we discuss the SM stochastic control formulation and results presented in [?] which serve as the starting point for our work in subsequent chapters. We extend the notation of [?] to include multiple sensors and additional modes such as search. 2.2.1 Stationary SM Problem Formulation Assume there are a finite number of locations 1, . . . , N, each of which may have an object with a given type, or which may be empty. Assume that there is a set of S sensors, each of which has multiple sensor modes, and that each sensor can observe one and only one location at each time with a selected mode. This assumption can be relaxed, although it introduces additional complexity in the exposition and the computation. Let xi ∈ {0, 1, . . . , D} denote the state of location i, where xi = 0 if location i is unoccupied, and otherwise xi = k > 0 indicates location i has an object of type k. Let πi(0) ∈ ℜD+1 be a discrete probability distribution over the possible states for
  • 37.
    15 0 1 2 0 0.5 1 (1,1) x1 Pr(x 1 ) 01 2 0 0.5 1 (1,2) x2 Pr(x2 ) Representative beliefs for N locations 0 1 2 0 0.5 1 (1,3) x3 Pr(x3 ) 0 1 2 0 0.5 1 (2,1) x4Pr(x 4 ) 0 1 2 0 0.5 1 (2,2) x5 Pr(x5 ) 0 1 2 0 0.5 1 (2,3) x6 Pr(x6 ) 0 1 2 0 0.5 1 (3,1) x 7 Pr(x 7 ) 0 1 2 0 0.5 1 (3,2) x 8Pr(x8 ) 0 1 2 0 0.5 1 (3,3) x 9 Pr(x9 ) Figure 2·1: Illustrative example set of prior-probabilities πi(0) using a MATLAB “stem” plot for case where N = 9 and D = 2. Assuming objects are classified into 3 types, the Maximum-Likelihood estimate of locations xi with i ∈ {3, . . . , 6, 8, 9} is type 0 (empty). The ML estimate of xi for i ∈ {1, 2, 7} is type 1. the ith location for i = 1, . . . , N where D ≥ 2. Assume that the random variables xi, i = 1, . . . , N, are mutually independent. If such an independence assumption is not assumed, then it is possible to learn state information about location i from a measurement of location j (with i = j). Fig. 2·1 shows a set of probability mass functions involving N = 9 locations with D = 2 arranged in a 2D grid. Let there be s = 1, . . . , S sensors, each of which has m = 1, . . . , Ms possible modes of observation. We assume there is a series of T discrete decision stages where sensors can select which location to measure, where T is large enough so that all of the sensors can use their available resources. At each stage, each sensor can choose to employ one and only one of its modes on a single location to collect a noisy measurement concerning the state xi at that location. Each sensor s has a limited set of locations that it can
  • 38.
    16 observe, denoted byOs ⊆ {1, . . . , N}. A sensor action by sensor s at stage t is a pair: us(t) = (is(t), ms(t)) (2.1) consisting of a location to observe, is(t) ∈ Os, and a mode for that observation, ms(t). Sensor measurements by sensor s with mode m at stage t, ys,m(t) are modeled as be- longing to a finite set ys,m(t) ∈ {1, . . . , Ls}. The conditional probability of the measured value is assumed to depend on the sensor s, sensor mode m, location i and on the true state at the location, xi, but not on the states of other locations. Denote this condi- tional probability as P(ys,m(t)|xi, i, s, m). We assume that this conditional probability given xi is time-invariant, and that the random measurements ys,m(t) are conditionally independent of other measurements yσ,n(τ) given the states xi, xj for all sensors s, σ and modes m, n provided i = j or τ = t. Each sensor has a limited quantity of Ri resources available for measurements over the T stages of time. Associated with the use of mode m by sensor s on location i is a resource cost rs(us(t)) to use this mode, representing power or some other type of resource required to use this mode from this sensor. T−1 t=0 rs(us(t)) ≤ Rs ∀ s ∈ [1, . . . , S] (2.2) This is a hard constraint for each realization of observations and decisions. Let I(t) denote the history of past sensing actions and measurement outcomes up to and including stage t − 1: I(t) = {(us(τ), ys,m(τ)), s = 1, . . . , S; τ = 0, . . . , t − 1} As is frequently the case when working with POMDPs, we make use of the idea of the information history as a sufficient statistic for the state of the system/world x.
  • 39.
    17 Under the assumptionof conditional independence of measurements and indepen- dence of individual states at each location, the joint probability π(t) = P(x1 = k1, x2 = k2, . . . , xN = kN |I(t)) can be factored as the product of belief-states (marginal condi- tional probabilities) for each location. Denote the conditional probability (belief-state) at location i as πi(t) = p(xi|I(t)). The probability vector π(t) is a sufficient statistic for all information that is known about the state of the N locations up until time t. When a sensor measurement is taken, the belief-state π(t) is updated according to Bayes’ Rule. A measurement of location i with the sensor-mode combination us(t) = (i, m) at stage t that generates observable ys,m(t) updates the belief-vector as: πi(t + 1) = diag{P(ys,m(t)|xi = j, i, s, m)}πi(t) 1T diag{P(ys,m(t)|xi = j, i, s, m)}πi(t) (2.3) where 1 is the D + 1 dimensional vector of all ones. Eq. 2.3 captures the relevant information dynamics that SM controls. For generality, the index i in the likelihood function specifies that the sensor statistics could vary on a location-by-location basis. Of prime importance is the fact that using π(t) as a sufficient statistic along with Eq. 2.3, we are able to combine a priori probabilities represented by π(t) with conditional probabilities given by sensor measurements in order to form posterior probabilities, π(t + 1), in recursive fashion: beliefs can be maintained and propagated. In addition to information dynamics, there are resource dynamics that characterize the available resources at stage t. The dynamics for sensor s are given as: Rs(t + 1) = Rs(t) − rs(us(t)); Rs(0) = Rs (2.4) These dynamics constrain the admissible decisions by a sensor, in that it can only use modes that do not use more resources than are available. An adaptive feedback strategy is a closed-loop policy or decision-making rule that maps collected information sets up until stage t, i.e. the sets Ii(τ) ∀ i ∈ [1, . . . , N], τ ∈
  • 40.
    18 [0, . .. , t − 1] to choose actions for stage t: γ : I(t) → U(t). Define a local strategy, γi, as an adaptive feedback strategy that chooses actions for location i purely based on the information sets Ii(t), which is to say based purely on the history of past actions and observations specific to location i. Given the final information, I(T), the quality of the information collected is measured by making an estimate of the state of each location i given the available information. Denote these estimates as vi ∀ i = 1, . . . , N. The Bayes’ cost of selecting estimate vi when the true state is xi is denoted as c(xi, vi) ∈ ℜ with c(xi, vi) ≥ 0. The objective of the SM stochastic control formulation is to minimize: J = N i=1 E[c(xi, vi)] (2.5) by selecting adaptive sensor control policies and final estimates subject to the dynamics of Eq. 2.3 and the constraints of Eq. 2.2 and Eq. 2.4. This problem was solved using dynamic programming in [Casta˜n´on, 2005a] as follows: Define the optimal value function V (π, R, t) that is the optimal solution to Eq. 2.5 subject to Eq. 2.2 and Eq. 2.4 when the initial information is π, and R is the vector of current resource levels. Let R = [R1 R2 . . . RS]T . Define πij as the jth component of the probability vector associated with location i. The value function is defined on SN × ℜS +. Let U represent the set of all possible sensor actions and define U(R) ⊂ U as the set of feasible actions with resource level R. The value function V (π, R, t) must be recursively related to V (π, R, t + 1), its value one time step earlier, according to Bellman’s Equation [Bellman, 1957]: V (π, R, t + 1) = min N i=1 min vi∈X j=0,...,D c(j, vi)πij, (2.6) min u∈U(R) E y {V (T(π, u, y), R − Ru, t)}
  • 41.
    19 where Ru =[r1(u1(t)) r2(u2(t)) . . . rS(uS(t))]T and T(.) is an operator describing the belief dynamics. T is the identity mapping for information states πj(t) ∀ j except for {j | j = is ∀ s ∈ [1, . . . , S]}, the set of sensed locations. For a sensed location i, T maps πi(t) to πi(t + 1) with Eq. 2.3. The expectation in Eq. 2.6 is given by: E y {V (T(π, u, y), R − Ru, t)} = y∈Ys,m ∀ s P(y|I(k), u)V (T(π, u, y), R − Ru, t) where Ys,m is the (discrete) set of possible observations (symbols) for sensor s with mode m, and y is a vector of measurements with one measurement per sensor. (The mode m for each sensor s is determined by the vector-action u in the minimization). The minimization is done over S dimensions because there are S sensors. To initialize the recursion, the optimal value function when the number of stages to go t is zero is determined by choosing the classification decision vi without any additional measurements, as V (π, R, 0) = N i=1 min vi∈X j=0,...,D c(j, vi)πij (2.7) Note that this minimization can be done independently for each location i. The optimal value of Eq. 2.5 can be computed using Eq. 2.6 - Eq. 2.7. The problem with the DP equation Eq. 2.6 as it currently stands is that whereas the measurement and classification costs of the N locations in the problem initially start off decoupled from each other (c.f. Eq. 2.7), the DP recursion does not preserve the decoupling from one stage to the next. Therefore in general the best choice of action for location i with t stages-to-go will depend on the amount of resources from each of the different sensors that have been expended on other locations during the previous stages. This leads to a very large POMDP problem with a combinatorial number of actions to consider and an underlying belief-state of dimension (D + 1)N that is computationally intractable unless there are few locations.
  • 42.
    20 In [Casta˜n´on, 2005b],the above problem is replaced by a simpler problem that provides a lower bound on the optimal cost, by expanding the set of admissible strategies, replacing the constraints of Eq. 2.2 with “soft” constraints: E[ T−1 t=0 rs(us(t))] ≤ Rs ∀ s ∈ [1 . . . S] (2.8) Note that every admissible strategy that satisfies Eq. 2.2 also satisfies Eq. 2.8. After relaxing the resource constraints, there is just one constraint per sensor (instead of one constraint for every possible realization of actions and observations per sensor). These constraints are constraints on the average resource use one would expect to spend over the planning horizon. To solve the relaxed problem, [Casta˜n´on, 2005b] proposed incorporation of the soft constraints in Eq. 2.8 into the objective function using Lagrange multipliers λs for each sensor s and using Lagrangian Relaxation. Now the measurement and classification costs for a pair of locations are only related through the values of the Lagrange multipliers associated with the sensors they use in common. Therefore given the price of time for the set of sensors that will be used in the optimal policy to make measurements on a pair of locations, the classification and measurement costs for those two locations are decoupled in expectation! Once we can partition resources between a pair of locations, we can do so for N locations. The augmented objective function is: ¯Jλ = J + T−1 t=0 S s=1 λs E[rs(us(t))] − S s=1 λsRs (2.9) Define an admissible strategy, γ, as a function which maps an information state, π(t), to a feasible measurement action (or to a null action if sufficient resources are unavailable). Define Γ as the set of all possible γ. Because the measurements and possible sensor actions are finite-valued, the set of possible SM strategies Γ is also finite.
  • 43.
    21 Let Q(Γ) denotethe set of mixed strategies that assign probability q(γ) to the choice of strategy γ ∈ Γ. A key result in [Casta˜n´on, 2005b] was that when the optimization of Eq. 2.9 was done over mixed strategies for given values of Lagrange multipliers, λs, optimization problem in Eq. 2.9 decoupled into independent POMDPs for each location, and the optimization could be performed using local feedback strategies, γi, for each location i. Write ΓL for the set of all local feedback strategies. These POMDPs have an underlying information state-space of dimension D + 1, corresponding to the number of possible states at a single location, and can be solved efficiently. Decomposition is essential to make the problem tractable. The pair of figures Fig. 2·2 and Fig. 2·3 demonstrate what a stereotypical POMDP solution (for a toy problem) looks like. These figures describe the optimal set of solution hyperplanes and the optimal policy for SM on a set of locations given a vector of prices for resource costs (i.e. assuming for the moment that we already know what the optimal resource prices are). The brown and magenta hyperplanes (nodes 2 and 6 w.r.t. Fig. 2·3 are very nearly parallel to the neighboring hyperplanes and therefore two of the three hyperplanes with node id’s 1–3 are very nearly redundant (dominated) and the same goes for node id’s 5–7. The smaller the extent of a hyperplane in the concave (for cost functions) hull of the set of hyperplanes, the less role it has to play in the optimal value function. In this example the cost of the ‘Mode1’ action was 0.1 units and that of ‘Mode2’ was 0.18 units. If the ‘Mode2’ cost is changed to 0.2 units then there are only 7 hyperplanes in the optimal set of hyperplanes (i.e. the value function) and ‘Mode2’ is not used at all. These results are relative to the prior probability and sensor statistics. The alpha vectors (i.e. hyperplane coefficients) and actions associated with each hyperplane (equivalently decision-tree node) can be seen in the inset below the value function. The state enumeration was X={‘non-military’,‘military’}. The alpha
  • 44.
    22 vector coefficients givethe classification + measurement cost of a location having each one of these states w.r.t. this enumeration. In Fig. 2·3, assume hypothesis ‘H2’ corresponds to ‘Declare military vehicle’ and ‘H1’ is the null hypothesis (‘Declare non-military vehicle’). In this policy the arrows leaving each node on top represent observation ‘y1’ (‘non-military’), and the arrows on bottom represent ‘y2’ (‘military’). The 9 nodes on the left of this policy correspond to the 9 hyperplanes that make up Fig. 2·2. If there had been more than two possible actions and two possible observations, then after a few stages there could easily have been thousands of distinct nodes in the initial stage! This figure uses a model with a dummy/terminal capture state, so it is possible to stop sensing at any time. The use of two states, one sensor with two modes, two observations (the same type of observations for both modes) for a horizon 5 (4 sensing actions+classification) POMDP results in 9 hyperplanes based on the particular cost structure used: ‘Mode1’ costs 0.1 units and ‘Mode2’ costs 0.18 units. False alarms (FAs) and Missed detections (MDs) each cost 1 unit. For problems with several sensors, 4 possible states and 3 modes per sensor with 3 possible observations, there are frequently on the order of 500–1000 hyperplanes for a horizon 5 POMDP. Whereas originally this work was done using the Witness Algorithm in a modified version of pomdp-solve-5.3 [Cassandra, 1999], this algorithm is slow when solving 1000’s (later millions) of POMDPs in a loop. Therefore, by default we use the Finite Grid algorithm (PBVI) within our customized version of pomdp-solve-5.3 with 500–1000 belief-points to solve POMDPs. This allows a POMDP of this size to be solved within about 0.2 sec on a single-core, Intel P4, 2.2 GHz, Linux machine. There is a trade-off between correctly detecting objects and engendering false alarms. Fig. 2·4 illustrates how the overall classification cost increases as the ratio of the MD:FA cost increases (from 1:1 through 80:1) for 3 resource levels {300, 500, 700} according to
  • 45.
    23 Figure 2·2: Hyperplanesfrom the Value Iteration algorithm that accom- pany Fig. 2·3. Figure 2·3: Policy graphs for the optimal classification of a location with state {‘non-military’,‘military’}, two possible actions {‘Mode1’,‘Mode2’}, and two possible observations {‘y1’,‘y2’}.
  • 46.
    24 two cases: inthe first case the resources are all available to a single sensor that supports two modes of operation {‘mode1’, ‘mode2’}, and in the second case the resources are equally divided between two identical sensors that each support the same two modes of operation. Partitioning resources in this way adds an additional constraint that increases classification cost. We also observe that the larger the quantity of resources available, the larger the discrepancy between the S = 1, M = 2 case and the S = 2, Ms = 1 ∀ s case. In order to have meaningful POMDP solutions, we must have a way of coordinating the sensing activities between various locations. Lagrange multipliers and Lagrangian Relaxation provide this coordinating mechanism. Writing our policies for SM in terms of mixed strategies allows linear programming techniques to be used for Lagrangian Relaxation. To this end, we write Eq. 2.9 in terms of mixed strategies: ˜J∗ λ = min γ∈Q(ΓL) E γ N i=1 c(xi, vi) + T−1 t=0 S s=1 λsrs(us(t)) − S s=1 λsRs (2.10) where the strategy γ maps the current information state π(t) to the choice of us(t) ∀ s. At stage T the strategy γ also determines the classification decisions vi ∀ i. On account of the fact that we chose a relaxed form of resource constraint in Eq. 2.2, we know that the actual optimal cost must be lower bounded by Eq. 2.10 because we have expanded the space of feasible actions. This identification leads to the inequality: J∗ ≥ sup λ1,...,λS≥0 ˜J∗ λ1,...,λS (2.11) As shown in [Casta˜n´on, 2005a], Eq. 2.11 is the dual of the LP: min q∈Q(ΓL) γ∈ΓL q(γ) E γ [J(γ)] (2.12)
  • 47.
    25 Figure 2·4: Thisfigure is a plot of expected cost (measure- ment+classification) versus MD for 3 different resource levels. The solid (blue) line gives the performance when the resources are pooled into one sensor and the dashed (red) line gives the performance when the resources are split across two sensors.
  • 48.
    26 γ∈ΓL q(γ) E γ N i=1 T−1 t=0 rs(us(t)) ≤Rs ∀ s ∈ [1, . . . , S] (2.13) γ∈ΓL q(γ) = 1 (2.14) where we have one constraint for each of the S sensor resource pools and an additional simplex constraint in Eq. 2.14 which ensures that q ∈ Q(ΓL) forms a valid probability distribution. This is a large LP, where the number of possible variables are the strategies in ΓL. However, the total number of constraints is S + 1, which establishes that optimal solutions of this LP are mixtures of no more than S + 1 strategies. Thus, one can use a Column Generation approach [Gilmore and Gomory, 1961,Dantzig and Wolfe, 1961,Yost and Washburn, 2000] to quickly identify an optimal mixed strategy that solves the relaxed (i.e. approximate) form of our SM problem. (See Appendix A.3 for an overview of Column Generation). To use Column Generation with the LP formulation Eq. 2.12 - Eq. 2.14, we break the original problem hierarchically into two new sets of problems that are called the master problem and subproblems. There is one POMDP subproblem for each location. The master problem consists of identifying the appropriate values of the Lagrange multipliers, λs ∀ s, to determine how resources should be shared across locations, and the subproblems consist of using these Lagrange multipliers to compute the expected resource usage and expected classification cost for each of the N locations. See Fig. 2·5 for a pictorial representation. Column Generation works by solving Eq. 2.12 and Eq. 2.13, restricting the mixed strategies to be mixtures of a small subset Γ′ L ⊂ ΓL. The solution of the restricted LP has optimal dual prices λs, s = 1, . . . , S. Using these prices, one can determine a corre- sponding optimal pure strategy by minimizing Eq. 2.9, which the results in [Casta˜n´on, 2005b] show can be decoupled into N independent optimization problems, one for each location. Each of the subproblems is solved as a POMDP using standard algorithms,
  • 49.
    27 Figure 2·5: Schematicshowing how the master problem coordinates the ac- tivities of the POMDP subproblems using Column Generation and Lagrangian Relaxation. After the master problem generates enough columns to find the optimal values for the Lagrange multipliers, there is no longer any benefit to violating one of the resource constraints and the subproblems (with augmented costs) are decoupled in expectation. such as Point-Based Value Iteration (PBVI), Appendix A.2 [Pineau et al., 2003], to de- termine the best pure strategy γ1 for these prices. Solving all of the subproblems allows a new column to be generated by providing values for the expected classification cost and expected resource utilization for a given set of sensor prices λs; these values become the coefficients in the new column in the (Revised) Simplex Tableau of the master problem. The column that is generated will be a pure strategy that is not already in the basis of the LP (or else the master problem would have converged). If the best pure strategy, γ1, for the prices, λs ∀ s ∈ [1, . . . , S], is already in the set Γ′ L, then the solution of Eq. 2.12 and Eq. 2.13 restricted to Q(Γ′ L) is an optimal mixed strategy over all of Q(ΓL), and the Column Generation algorithm terminates. Otherwise, the strategy γ1 is added to the admissible set Γ′ L, and the iteration is repeated. The solution to this algorithm is a set of mixed strategies that achieve a performance level that is a lower bound on the original SM optimization problem with hard constraints. 2.2.2 Addressing the Search versus Exploitation Trade-off As one contribution of this dissertation, we address how to non-myopically trade-off be- tween spending time searching for objects versus spending time acting on them. This is
  • 50.
    28 P(y1,1(t)|xi, u1(t)) =P(y1,2(t)|xi, u2(t)) = P(y1,3(t)|xi, u3(t)) = empty car truck SAM 0 B B B B B @ 0.92 0.04 0.04 0.08 0.46 0.46 0.08 0.46 0.46 0.08 0.46 0.46 1 C C C C C A o1 o2 o3 0 B B B B B @ 0.95 0.03 0.02 0.05 0.85 0.10 0.05 0.10 0.85 0.05 0.10 0.85 1 C C C C C A o1 o2 o3 0 B B B B B @ 0.97 0.02 0.01 0.02 0.95 0.03 0.02 0.90 0.08 0.02 0.03 0.95 1 C C C C C A o1 o2 o3 Table 2.1: Example of expanded sensor model for an SEAD mission scenario where the states are {‘empty’, ‘car’, ‘truck’, ‘SAM’} and the observations are ys,m = {o1 = ‘see nothing’, o2 = ‘civilian vehicle’, o3 = ‘military vehicle’} ∀ s, m. This setup models a single sensor with modes {u1 = ‘search’, u2 = ‘mode1’, u3 = ‘mode2’} where mode2 by definition is a higher-quality mode than mode1. Using mode1, trucks can look like SAMs, but cars do not look like SAMs. an easy generalization to make: object confusion matrices can be used to allow inferenc- ing based on detection only. To accomplish this, we augment the sensor model used in example of [Casta˜n´on, 2005a] with a ‘search’ action that supports a low-res mode of op- eration (with low resource demand) designed for object detection but incapable of object classification. Our sensor observation models can be made non-informative w.r.t. object type by setting the conditional probability of an observation for each type of object to be the same as in Table 2.1. As a simple starting point for the new sensor model, we consider three possible values of observations : {o1 = ‘see nothing’, o2 = ‘uninteresting object’, o3 = ‘interesting object’} that have known statistics and are the result of pre-processing and thresholding sensor data. The ‘search’ action effectively returns the joint probability of P(o2 ∩ o3|xi, u1). 2.2.3 Tracing Decision-Trees One hindrance is that the hyperplanes given by a POMDP solver that represent the expected cost-to-go are in terms of total cost. In order to create a new column, it is necessary to separate out the classification cost from the measurement costs. This pro- cess is best illustrated with an example. Consider Fig. 2·6. This figure is an illustration
  • 51.
    29 of “tracing” or“walking” a POMDP decision-tree solution to calculate expected clas- sification costs and resource utilizations for a subproblem. The states in this example are indexed as {‘military’, ‘truck’, ‘car’, ‘empty’}. The dot-product of the probability vector for the current information state, in this example π(0) = [0.02 0.06 0.12 0.80]T , with the best hyperplane returned by Value Iteration (or the approximation PBVI) gives the total cost for location i: Ji,total = Ji,measure +Ji,classify, from which we can calcu- late the subproblem classification cost, Ji,classify, once we subtract out its measurement cost, Ji,measure. To subtract out the measurement cost, we must recursively traverse the decision-tree and sum up the (expected) cost of each potential measurement action. The probability mixture weights in the expected cost of each action are given by the obser- vation probabilities P(ys,m(t)|πi(t), u(t)), where u(t) is the sensor action taken. For the particular initial probability of π(0), only actions in the set {‘wait’, ‘search’} are part of the optimal solution (for simplicity of illustration). The numbers in blue represent the conditional likelihood of an observation occurring, and the color of each node rep- resents the optimal choice of action for that information state (and nearby information states): {white=‘wait’, aqua=‘search’}. Given this decision-tree that represents the op- timal course of action for the information state π(0), the set of possible future beliefs and the relative likelihood of each belief occurring are shown. The possible beliefs and likelihoods display their respective observation histories up to time t using the conven- tion o = {y(0), . . . , y(t−1)}. In this example false alarm and missed detection costs are equal (FA=MD), and the (time-invariant) likelihoods for the ‘search’ action are: P(ysearch(t)|xi, ‘search’) =   0.08 0.46 0.46 0.08 0.46 0.46 0.08 0.46 0.46 0.92 0.04 0.04   In this matrix, the states xi vary along the rows, and the observations {‘o0’, ‘o1’, ‘o2’} (for {‘see nothing’, ‘see non-military vehicle’, ‘see military vehicle’}) vary across the columns.
  • 52.
    30 All indices are0-based. There were 3 observations (which in general implies three child nodes for every node in the decision-tree) for some actions, but the search action has a uniform observation probability over all non-empty observations (all observations except ‘o0’), and therefore the latter two future node indices for search nodes (nodes that specify search actions) are always the same. (This keeps the example tractable). For a ‘wait’ action, all three future nodes are the same because there is only one possible future belief- state. The green terminal classification (‘declaration’) node represents the decision that a location contains a benign object (‘truck’, ‘car’) and the gray declaration node that the location is ‘empty’. The nodes are labeled using the scheme: ‘[nodeId nextNodeId[o0] nextNodeId[o1] nextNodeId[o2]’, so for the root node (stage 0, node 0) the next node will be (stage 1, node 4) if the observation is o0, (stage 1, node 1) if the observation is o1, and again (stage 1, node 1) for observation o2 because a search action can not discriminate object type. The declaration nodes have no future nodes, which is indicated with ‘X’ characters. Notice there are two possible information states (beliefs) for π(2) at nodeId 0 during the second-to-last stage, and therefore the conditional observation probabilities at this node are path-dependent (the nodes represent a convex region of belief-space, they do not represent a unique probability vector). The red star and black box in the figure indicate the two different possible beliefs (and therefore the two different possible sets of observation likelihoods) for this node. The vector πnew(1) represents the future belief-state after one time interval if no action is taken. In other words the state is non-stationary in this example and in fact a HMM model was imposed on the state of each location. The HMM has an arrival probability (chance of leaving the ‘empty’ state) at each stage of 5%: the probability of a location being empty goes from 80% to 76% in one stage (0.05 × 0.80 = 0.76), and this probability mass diffuses elsewhere (increases the chance of state ‘military’). One caveat w.r.t. the software package we used (an extensively modified version of
  • 53.
    31 Tony Cassandra’s pomdp-solve-5.3[Cassandra, 1999]) that comes into play with a time- varying state is that an observation at stage t (undesirably) refers to the system state at time t+1. This is the convention used in the robotics community but is not desirable in an SM context: it’s anti-causal. There is no difference if the state is stationary. We will pick up this topic of non-stationarity again in the next section and then as one of the main topics of Ch. 4. Each possible terminal belief-state is indicated along with the associated proba- bility of classification error in the lower-right corner. By way of example, while the P(error|π(3;o=0,0,0)) = 0.0052, the expected contribution of this error to the terminal classification cost is even smaller because the likelihood of this particular outcome is the joint probability of the associated observations: P(y0 = 0, y1 = 0, y2 = 0) = P(y0 = 0)P(y1 = 0)P(y2 = 0) = 0.7184∗0.8567∗0.8724 (the numbers in blue along this realiza- tion of the decision-tree). The variables in the conditioning were suppressed for brevity. Unfortunately, walking the decision-trees to back out classification costs is rather slow (recursive function calls) with large trees, requiring on the order of 15% of the compu- tational time in simulations with horizon 6 plans (PBVI took around 80%), however at least this operation is parallelizable, and the PBVI algorithm is parallelizable as well! As a slightly more complex example of a set of POMDP solutions and what tracing decision-trees entails, consider Fig. 2·7 and Fig. 2·8 which show a pair of decision-trees for a horizon 6 scenario with D = 3 and Ms = 3 (plus a ‘wait’ action). The state ‘empty’ has been added to X and a ‘search’ mode has been added to the action space. The ‘search’ mode is able to quickly detect the presence or absence of an object but completely unable to specify object type. In addition, an HMM has been used instead of having the state be stationary such that the model allows for a non-zero probability of object arrivals from one stage to the next. This example uses an object arrival probability of 5% / stage. It is interesting to note the situations in which the optimal
  • 54.
    32 Figure 2·6: Illustrationof “tracing” or “walking” a decision-tree for a POMDP subproblem to calculate expected measurement and classification costs (the individual costs from the total).
  • 55.
    33 Figure 2·7: Strategy1 (mixture weight=0.726). πi(0) = [0.1 0.6 0.2 0.1]’ ∀ i ∈ [0, . . . , 9], πi(0) = [0.80 0.12 0.06 0.02]T ∀ i ∈ [10, . . . , 99]. The first 10 objects start with node 5, the remaining 90 start with node 43. The notation [i Ni0 Ni1 Ni2] indicates the next node/action from node i as a function of observing the 0th, 1st or 2nd observations respectively. strategy is to wait to act versus to gain as much information as possible with the time available. 2.2.4 Violation of Stationarity Assumptions At first glance, using a HMM state model with arrival probabilities, as in Fig. 2·6, seems to violate our stationarity assumptions that allowed us to decompose the original problem into one in which we have parallel virtual time-lines happening at each location and where we did not need to worry about the sequencing of events between these locations. Given stationarity, the order in which locations are sensed does not matter. We notice that the same thing is true for a HMM with arrival probabilities because having an object arrive at a location does not influence the optimal choice of sensing
  • 56.
    34 Figure 2·8: Strategy2 (mixture weight=0.274) πi(0) = [0.1 0.6 0.2 0.1]’ ∀ i ∈ [0, . . . , 9], πi(0) = [0.80 0.12 0.06 0.02]T ∀ i ∈ [10, . . . , 99]. The first 10 objects start with node 6, the remaining 90 start with node 18. actions at that location in the past when the location was empty. If a location is empty, there is no sensing to be done. An arrival only affects the best choices of sensing action for that location in the future, and we replan every round. Therefore we can still decouple sensing actions across locations when we have arrivals. The problem in developing a model for a time-varying state is how to handle object departures. If an object departs (a location becomes empty), now the best choice of previous actions for that location are affected retroactively. 2.3 Column Generation And POMDP Subproblem Example In this section we present an example of the Column Generation algorithm and POMDP algorithms discussed previously. In this simple example we consider 100 objects (N=100), 2 possible object types (D=2) with X = {‘non-military vehicle’, ‘military vehicle’}, and
  • 57.
    35 2 sensors thateach have one mode (S = 2 and Ms = 1 ∀ s ∈ {1, 2}). Sensor s actions have resource costs: rs, where r1 = 1, r2 = 2. Sensors return 2 possible observation values, corresponding to binary object classifications, with likelihoods: P(y1,1(t)|xi, u1(t)) P(y2,1(t)|xi, u2(t)) 0.90 0.10 0.10 0.90 0.92 0.08 0.08 0.92 where the (j, k) matrix entry denotes the likelihood that y = k if xi = j. The second sensor has 2% better performance than the first sensor but requires twice as many resources to use. Each sensor has Rs = 100 units of resources, and can view each location. Each of the 100 locations has a uniform prior of πi = [0.5 0.5]T ∀ i. For the performance objective, we use c(xi, vi) = 1 if xi = vi, and 0 otherwise, so the cost is 1 unit for a classification error. Table 2.2 demonstrates the Column Generation solution process. The first three columns are initialized by guessing values of resource prices and obtaining the POMDP solutions, yielding expected costs and expected resource use for each sensor at those resource prices. A small LP is solved to obtain the optimal mixture of the first three strategies γ1, . . . , γ3, and a corresponding set of dual prices. These dual prices are used in the POMDP solver to generate the fourth column γ4, which yields a strategy that is different from that of the first 3 columns. The LP is re-solved for mixtures of the first 4 strategies, yielding new resource prices that are used to generate the next column. This process continues until the solution using the prices after 7 columns yields a strategy that was already represented in a previous column, terminating the algorithm. The optimal mixture combines the strategies of the second, fifth and sixth columns. When the master problem converges, the optimal cost, J∗ , for the mixed strategy is 5.95 units. The resulting decision-trees are illustrated in Fig. 2·9, where branches up indicate measurements y = 1 (‘non-military’) and down y = 2 (‘military’). The red and green
  • 58.
    36 γ1 γ2 γ3γ4 γ5 γ6 γ7 min 50.0 2.80 2.44 1.818 8 10 6.22 R1 0 218 200 0 0 100 150 ≤ 100 R2 0 0 36 800 200 0 18 ≤ 100 Simplex 1 1 1 1 1 1 1 = 1 Optimal cost - - 26.22 21.28 7.35 5.95 5.95 Mixture weights 0 0.424 0 0 0.500 0.076 0 λc 1 1.0e15 0.024 0.010 0.238 0.227 0.217 0.061 λc 2 1.0e15 0.025 0.015 0 0.060 0.210 0.041 Table 2.2: Column Generation example with 100 objects. The tableau is displayed in its final form after convergence. λc s describe the lambda trajectories up until convergence. R1 and R2 are resource constraints. γ1 is a ‘do-nothing’ strategy. Bold numbers represent useful solution data. Figure 2·9: The 3 pure strategies that correspond to columns 2, 5 and 6 of Table 2.2. The frequency of choosing each of these 3 strategies is controlled by the relative proportion of the mixture weight qc ∈ (0..1) with c ∈ {2, 5, 6}.
  • 59.
    37 nodes denote thefinal decision, vi, for a location. Note that the strategy of column 5 uses only the second sensor, whereas the strategies of columns 2 and 6 use only the first sensor. The mixed strategy allows the soft resource constraints to be satisfied with equality. Table 2.2 also shows the resource costs and expected classification performance of each column. The example illustrates some of the issues associated with the use of soft constraints in the optimization: the resulting solution does not lead to SM strategies that will always satisfy the hard constraints Eq. 2.2. The Column Generation algorithm generates approximate SM solutions that can be infeasible in two different respects: first of all, when improbable observations happen then the expected resource usage can deviate significantly from the optimal choice of action based on the new information; surprising observations can create unexpected surpluses or shortages in sensor resource budgets and the effect of these variations needs to be mitigated. The second way in which the Column Generation solution is approximate is w.r.t. the mixture weights used to randomize between the various strategies in the solution set. It is possible that the distribution (probability mass function) that determines the relative frequency of utilization of each pure strategy will be degenerate; this tends to happen towards the end of a simulation when resources are running low. If the governing distribution is degenerate and only one pure strategy has support, this pure strategy gives the optimal solution to the SM problem (though still just w.r.t. expected resource usage). If multiple pure strategies have utility, there will frequently be prescribed courses of action, as determined by the operative LP relaxation, that choose to do e.g. 10% of an action that costs 10 units and 50% of an action that costs 2 units (and 40% of the time to do nothing) if there are 2 units of resources left. When there are large numbers of resources and actions yet to be done, this type of approximation works out well, but when there are just a few resources remaining such sensing plans can give results that are always infeasible.
  • 60.
    38 We address theseapproximation errors and how to handle the termination issues in the next chapter.
  • 61.
    39 Chapter 3 Receding HorizonControl with Approximate, Mixed Strategies In the previous chapter we have discussed the theoretical basis for Receding Horizon (RH) control based on mixtures of pure strategies that near-optimally solve an approx- imate version of the SM problem given by Eq. 2.5 with constraints Eq. 2.2 and Eq. 2.4. As previously alluded to in Section 2.2, there are several difficulties that arise with making use of these mixed strategies. The development of a RH control algorithm that can make the most of these approximate, mixed strategies without unduly sacrificing performance (or seen from the dual perspective, wasting resources) represents one of the main contributions of this dissertation. This chapter develops a set of RH (aka Model Predictive Control (MPC) or Open-Loop Feedback Control (OLFC)) algorithms that use replanning to deal with the approximate nature of these mixed strategy solutions. We explore three different alternatives for generating pure strategies from a set of mixed strategies and explore the costs and benefits of each of these methods numerically. In addition to experimenting with these different methods for using the mixed strategies, we explore various possible parameter configurations for the planning horizon, resource levels, FA to MD ratios, sensor homogeneity versus heterogeneity and sensor visibility conditions (using fractional factorial design of experiments).
  • 62.
    40 3.1 Receding HorizonControl Algorithm The Column Generation algorithm described in Eq. 2.11 - Eq. 2.14 solves the approx- imate SM problem Eq. 2.9 with “soft” constraints in terms of mixed strategies that, on average, satisfy the resource constraints. However, for control purposes, one must select actual SM actions that satisfy the hard constraints Eq. 2.2 and Eq. 2.4. Another issue is that the solutions of the decoupled POMDPs provide individual sensor schedules for each location that must be interleaved into a single coherent sensor schedule on a common time-line. Furthermore, exact solution of the small, decoupled POMDPs for each set of prices can be time consuming, making the resulting algorithm unsuitable for real-time SM. To address this issue, we explore a family of RH algorithms that convert the mixed strategy solutions discussed in the previous chapter to actions that satisfy the hard constraints, and limit the computational complexity of the resulting algorithm. These algorithms for RH control start at stage t with an information state/resource state pair, consisting of available information about each location i = 1, . . . , N repre- sented by the conditional probability vector πi(t) and available sensor resources Rs(t), s = 1, . . . , S. The first step in the RH algorithms is to solve the SM problem of Eq. 2.5 starting at stage t to final stage T subject to the soft constraints Eq. 2.8, using the hi- erarchical Column Generation/POMDP algorithms to obtain a set of mixed strategies. We introduce a parameter corresponding to the maximum number of sensing actions per location to control the resulting computational complexity of the POMDP subproblem solutions. The second step is to select sensing actions to implement at the current stage t from the mixed strategies. These strategies are mixtures of at most S + 1 pure strategies, with associated probabilistic weights. We explore three approaches for selecting sensing actions:
  • 63.
    41 • str1: Selectthe pure strategy with maximum probability. • str2: Randomly select a pure strategy per location according to the optimal mix- ture probabilities. • str3: Select the pure strategy with positive probability that minimizes the expected sensor resource use over all sensors (and leaves resources for use in future stages). The column gen simulator that we have developed also supports two other methods for converting mixed strategies to sensing actions: • str4: Select the pure strategy that minimizes classification cost. • str5: Randomly select a single pure strategy for all locations jointly according to the optimal mixture probabilities. However these latter two methods for RH control were deemed to be less useful and therefore were not included in the fractional factorial design of experiments analysis. The pure strategies that are selected for each location map the current information sets, Ii(t) for location i, into a deterministic sensing action. ‘str1’ and ‘str3’ choose the same pure strategy to use across all locations, but ‘str2’ chooses a pure strategy on a location-by-location basis. Note that there may not be enough sensor resources to execute the selected actions, particularly in the case where the pure strategy with maximum probability is selected. To address this, we rank sensing actions by their expected entropy gain [Kastella, 1996]: Gain(us(t)) = H(πi(t)) − Ey[H(πi(t + 1))|y, us(t)] rs(us(t)) (3.1) where Ey[] is the expected future entropy value. We schedule sensor actions in order of decreasing expected entropy gain, and perform those actions at stage t that have enough sensor resources to be feasible. We also use the Entropy Gain algorithm at the
  • 64.
    42 very end ofa simulation when resources are nearly depleted, and the higher cost sensor modes are no longer feasible, see Appendix B for more information. (When the horizon is very short, the Entropy Gain algorithm is nearly-optimal, so this does not constitute a significant performance limitation in our design). The measurements collected from the scheduled actions are used to update the infor- mation states πi(t + 1) using Eq. 2.3. The resources used by the actions are eliminated from the available resources to compute Rs(t + 1) using Eq. 2.4. The RH algorithm is then executed from the new information state/resource state condition in iterative fashion until all resources are expended. 3.2 Simulation Results In order to evaluate the relative performance of the alternative RH algorithms, we performed a set of simulations. In these experiments, there were 100 locations, each of which could be empty, or have objects of three types, so the possible states of location i were xi ∈ {0, 1, 2, 3} where type 1 represents cars, type 2 trucks, and type 3 military vehicles. Sensors can have several modes: a ‘search’ mode, a low resolution ‘mode1’ and a high resolution ‘mode2’. The search mode primarily detects the presence of objects; the low resolution mode can identify cars, but confuses the other two types, whereas the high resolution mode can separate the three types. Observations are modeled as having three possible values. The search mode consumes 0.25 units of resources, whereas the low-resolution mode consumes 1 unit and the high resolution mode 5 units, uniformly for each sensor and location. Table 3.1 shows the likelihood functions that were used in the simulations. Initially, each location has a state with one of two prior probability distributions: πi(0) = [0.10 0.60 0.20 0.10]T , i ∈ [1, . . . , 10] or πi(0) = [0.80 0.12 0.06 0.02]T , i ∈ [11, . . . , 100]. Thus, the first 10 locations are likely to contain objects, whereas the other
  • 65.
    43 Search Low-res Hi-res o1o2 o3 o1 o2 o3 o1 o2 o3 empty 0.92 0.04 0.04 0.95 0.03 0.02 0.95 0.03 0.02 car 0.08 0.46 0.46 0.05 0.85 0.10 0.02 0.95 0.03 truck 0.08 0.46 0.46 0.05 0.10 0.85 0.02 0.90 0.08 military 0.08 0.46 0.46 0.05 0.10 0.85 0.02 0.03 0.95 Table 3.1: Observation likelihoods for different sensor modes with the ob- servation symbols o1, o2 and o3. Low-res = ‘mode1’ and High-res = ‘mode2’. 90 locations are likely to be empty. When multiple sensors are present, they may share some locations in common, and have locations that can only be seen by a specific sensor, as illustrated in Fig. 3·1. In general we consider the resource levels: 300 units (resource poor scenario), 500 units (normal resources) and 700 units (resource rich scenario), where these numbers are the cumulative amounts for all sensors involved in a simulation. However, in situations in- volving search and classification, the available resources per sensor were scaled back. If for example a search and classify simulation had 2 sensors involved, then resources were reduced by a factor of 2 ∗ 5 = 10 because there are 2 sensors and because only about 20–25% of the locations are occupied, and the unoccupied locations can be ruled out very cheaply: if there were 80 empty locations, then it would be possible to search them all twice for just 40 resource units. About 85% of the likely-to-be-empty locations could be safely ruled out with 2 looks, and all of the more expensive measurements get focused on more ambiguous locations with the remaining 20–100 units (depending on the setup) of resources between both sensors. (Keeping resources calibrated this way makes comparisons possible across simulation parameter sets). The cost function used in the experiments, c(xi, vi) is shown in Table 3.2. The pa- rameter MD represents the cost of a missed detection, and is varied in the experiments. The variable “Horizon” is the total number of sensor actions allowed per location plus one additional action for estimating the location content (i.e. making a classification
  • 66.
    44 Figure 3·1: Illustrationof scenario with two partially-overlapping sensors. xi; vi empty car truck military empty 0 1 1 1 car 1 0 0 1 truck 1 0 0 1 military MD MD MD 0 Table 3.2: Decision costs decision). Table 3.3 shows simulation results for a search and classify scenario involving 2 iden- tical sensors (with the same visibility profile), evaluating the 3 alternative versions of the RH control algorithms with 3 resource levels: 30, 50 and 70 units. Table 3.4 dis- plays the accompanying lower bound performance computed for these simulations. The missed detection cost MD is varied from 1 to 5 to 10. (The MD cost affects the expected classification cost as discussed in Fig. 2·4). The results shown in Table 3.3 represent the average of 100 Monte Carlo simulation runs of the 3 RH algorithms. Graphical versions of this table are provided in Fig. 3·2 - Fig. 3·4. For a horizon 6 plan, the longest horizon studied, the simulation performance is close to that of the associated bound. The results show that the different methods of RH control have performance close to the optimal lower bound in most cases, with the exception being the case of MD = 5 with 70 units of sensing resources (per sensor). For shorter horizons sometimes the simulation performance is better than the “bound” because the MPC algorithm is allowing more sensor observations per object than the
  • 67.
    45 plan is accountingfor, — after doing multiple planning iterations. (A simulation may re-plan 5+ times before resources are exhausted, but the bound was computed assuming there will only be a maximum of two sensing opportunities per object for a horizon=3 problem). Obviously, the easily classified objects are ruled out of the competition for sensor resources (their identities are decided upon) early in the simulation process, and with every additional planning iteration, sensor resources are concentrated on the remaining objects whose identities are uncertain. In this situation the approximate sensor plan from the Column Generation algorithm does not match well with the way events unfold, and the “bound” is not a bound. However, after choosing a horizon that accounts appropriately for how many RH planning iterations there will be and how many sensing opportunities will take place at each location, the bounds are tight. In terms of which strategy is preferable for converting the mixed strategies to a pure strategy, the results of Table 3.3 are unclear. For short planning horizons in the RH algorithms, the preferred strategy appears to be to use the least resources (str3): because planning with a longer horizon improves performance minimally, we find that using a RH replanning approach with a short horizon in conjunction with a resource-conservative planning strategy can be used to reduce computation time with limited performance degradation. For the longer horizons, there was no significant difference in performance among the three strategies we investigated. In the next set of experiments, we compare the use of heterogeneous sensors that have different modes available. In these experiments, the 100 locations are guaranteed to have an object, so xi = 0 is not feasible. The prior probability of object type for each location is πi(0) = [0 0.7 0.2 0.1]T . Table 3.5 shows the results of experiments with sensors that have all sensing modes, versus an experiment where one sensor has only a low-resolution mode and the other sensor has both high and low-resolution modes. The table shows the lower bounds predicted by the Column Generation algorithm, to
  • 68.
    46 MD = 1MD = 5 MD = 10 Hor. 3 str1 str2 str3 str1 str2 str3 str1 str2 str3 Res 30 3.64 3.85 3.85 11.82 12.88 12.23 15.28 14.57 14.50 Res 50 2.40 2.80 2.43 6.97 6.93 7.84 10.98 9.99 10.45 Res 70 2.45 2.32 1.88 3.44 3.99 4.04 6.14 6.48 5.10 Hor. 4 Res 30 3.58 3.46 3.52 12.28 12.62 11.90 14.48 15.91 15.59 Res 50 2.37 2.21 2.33 7.44 7.44 7.20 9.94 9.28 10.65 Res 70 1.68 1.33 1.60 3.59 3.57 3.62 6.30 5.18 5.86 Hor. 6 Res 30 3.51 3.44 3.73 11.17 11.85 12.09 15.17 14.99 13.6 Res 50 2.28 2.11 2.31 7.29 8.02 7.70 10.67 10.47 11.25 Res 70 1.43 1.38 1.44 3.60 3.73 3.84 4.91 5.09 5.94 Table 3.3: Simulation results for 2 homogeneous, multi-modal sensors in a search and classify scenario. str1: select the most likely pure strategy for all locations; str2: randomize the choice of strategy per location according to mixture probabilities; str3: select the strategy that yields the least expected use of resources for all locations. See Fig. 3·2 - Fig. 3·4 for the graphical version of this table. MD Horizon 3 1 5 10 Res[30, 30] 4.96 12.11 14.64 Res[50, 50] 4.09 8.79 11.16 Res[70, 70] 3.38 6.20 8.19 Horizon 4 Res[30, 30] 4.24 11.86 14.56 Res[50, 50] 3.09 6.72 9.50 Res[70, 70] 2.16 4.24 5.94 Horizon 6 Res[30, 30] 3.35 11.50 13.85 Res[50, 50] 2.21 6.27 9.40 Res[70, 70] 1.32 2.95 4.96 Table 3.4: Bounds for the simulations results in Table 3.3. When the horizon is short, the 3 MPC algorithms execute more observations per object than were used to compute the “bound”, and therefore, in this case, the bounds do not match the simulations; otherwise, the bounds are good.
  • 69.
    47 1 2 3 0 5 10 15 FA:MD= 1 Bound = 4.960 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 12.110 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 14.640 E[J] 1 2 3 0 5 10 15 FA:MD = 1 Bound = 4.090 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 8.790 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 11.160 E[J] 1 2 3 0 5 10 15 FA:MD = 1 Bound = 3.380 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 6.200 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 8.190 E[J] Horizon = 3 Resources = 60.0 Resources = 100.0 Resources = 140.0 Figure 3·2: This figure is the graphical version of Table 3.3 for horizon 3. Simulation results for two sensors with full visibility and detection (X=’empty’, ’car’, ’truck’, ’military’) using πi(0) = [0.1 0.6 0.2 0.1]T ∀ i ∈ [0..9], πi(0) = [0.80 0.12 0.06 0.02]T ∀ i ∈ [10..99]. There is one bar in each sub-graph for each of the three simulation modes studied in this chapter. The theoretical lower bound can be seen in the upper-right corner of each bar-chart. 1 2 3 0 5 10 15 FA:MD = 1 Bound = 4.240 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 11.860 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 14.560 E[J] 1 2 3 0 5 10 15 FA:MD = 1 Bound = 3.090 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 6.720 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 9.500 E[J] 1 2 3 0 5 10 15 FA:MD = 1 Bound = 2.160 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 4.240 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 5.940 E[J] Horizon = 4 Resources = 60.0 Resources = 100.0 Resources = 140.0 Figure 3·3: This figure is the graphical version of Table 3.3 for horizon 4.
  • 70.
    48 1 2 3 0 5 10 15 FA:MD= 1 Bound = 3.350 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 11.500 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 13.850 E[J] 1 2 3 0 5 10 15 FA:MD = 1 Bound = 2.210 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 6.270 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 9.400 E[J] 1 2 3 0 5 10 15 FA:MD = 1 Bound = 1.320 E[J] 1 2 3 0 5 10 15 FA:MD = 5 Bound = 2.950 E[J] 1 2 3 0 5 10 15 FA:MD = 10 Bound = 4.960 E[J] Horizon = 6 Resources = 60.0 Resources = 100.0 Resources = 140.0 Figure 3·4: This figure is the graphical version of Table 3.3 for horizon 6. illustrate the change in performance expected from the different architectural choices of sensors. The results indicate that specialization of one sensor can lead to significant degradation in performance due to inefficient use of its resources. The next set of results explore the effect of spatial distribution of sensors. We con- sider experiments where there are two homogeneous sensors which have only partially- overlapping coverage zones. (We define a “visibility group” as a set of sensors that have a common coverage zone). Table 3.6 gives bounds for different percentages of overlap. Note that, even when there is only 20% overlap, the achievable performance is similar to that of the 100% overlap case in Table 3.5, indicating that proper choice of strategies can lead to efficient sharing of resources from different sensors by equalizing their workload. The last set of simulation results we consider show the performance of these RH algorithms for three homogeneous sensors with partial sensor overlap, no detection and varying resource levels. The visibility groups are graphically portrayed in Fig. 3·5. Table 3.7 presents the simulated cost values averaged over 100 simulations of the three different RH algorithms. See Table 3.8 for the accompanying lower bounds. Fig. 3·6 - Fig. 3·8 display the graphical version of these tables. The results support our previous
  • 71.
    49 Homogeneous Heterogeneous MD MD Horizon3 1 5 10 1 5 10 Res[150, 150] 5.689 16.928 30.380 6.338 18.150 31.233 Res[250, 250] 4.614 16.114 25.917 5.527 16.767 29.322 Res[350, 350] 4.225 15.301 21.453 5.123 16.414 27.411 Horizon 4 Res[150, 150] 5.016 16.059 20.606 5.641 16.849 20.606 Res[250, 250] 3.939 9.461 12.662 4.576 12.047 14.873 Res[350, 350] 3.352 8.578 12.474 4.275 9.407 12.651 Horizon 6 Res[150, 150] 4.618 15.661 19.564 5.271 16.202 19.564 Res[250, 250] 2.919 8.237 10.913 3.321 8.830 11.347 Res[350, 350] 2.175 4.860 7.151 2.658 6.629 9.174 Table 3.5: Comparison of lower bounds for 2 homogeneous, bi-modal sensors (left 3 columns) versus 2 heterogeneous sensors in which S1 has only ‘mode1’ available but S2 supports both ‘mode1’ and ‘mode2’ (right 3 columns). There is 1 visibility-group with πi(0) = [0.7 0.2 0.1]T ∀ i ∈ [0..99]. For many of the cases studied there is a performance hit of 10–20%. Overlap 60% Overlap 20% MD MD Horizon 3 1 5 10 1 5 10 Res[150, 150] 5.69 16.93 30.38 5.69 16.93 30.38 Res[150, 150] 4.61 16.11 25.98 4.61 16.11 25.92 Res[150, 150] 4.23 15.30 21.45 4.23 15.30 21.45 Horizon 4 Res[150, 150] 5.02 16.06 20.61 5.02 15.93 20.61 Res[150, 150] 3.94 9.46 12.66 3.94 9.46 12.66 Res[150, 150] 3.35 8.58 12.47 3.35 8.58 12.47 Horizon 6 Res[150, 150] 4.62 15.66 19.56 4.62 15.66 19.56 Res[150, 150] 2.92 8.25 10.91 2.94 8.24 10.91 Res[150, 150] 2.18 4.86 7.19 2.18 4.86 7.16 Table 3.6: Comparison of sensor overlap bounds with 2 homogeneous, bi- modal, sensors and 3 visibility-groups. Both configurations use the prior πi(0) = [0.7 0.2 0.1]T . Compare and contrast with the left half of Table 3.5, most of the time the two sensors have enough objects in view to be able to efficiently use their resources for both the 60% and 20% overlap configurations; only the bold numbers are different.
  • 72.
    50 Figure 3·5: The7 visibility groups for the 3 sensor experiment indicating the number of locations in each group. conclusions: when a short horizon is used in one of the RH algorithms, and there are sufficient resources, the pure strategy that uses the least resources is preferred as it allows for replanning when new information is available. If the RH algorithm uses a longer horizon, then its performance approaches the theoretical lower bound, and the difference in performance between the three approaches for sampling the mixed strategy to obtain a pure strategy is statistically insignificant. To illustrate the computational requirements of this scenario (4 states, 3 observations, 2 sensors (6 actions), full sensor-overlap), the number of columns generated by the Column Generation algorithm to compute a set of mixed strategies was on the order of 10–20 columns for the horizon 5 algorithms, which takes about 60 sec on a 2.2 GHz, single-core, Intel P4 machine under Linux using C code in “Debug” mode (with 1000 belief-points for PBVI). Memory usage without optimizations is around 3 MB. There are typically 4–5 planning sessions in a simulation before resources are exhausted. Profiling indicates that roughly 80% of the computing time goes towards Value Backups in the PBVI routine and 15% goes towards tracing decision-trees in order to back out (deduce) the measurement costs from hyperplane costs (see Section 2.2.3). A set of simulations
  • 73.
    51 MD = 1MD = 5 MD = 10 Hor. 3 str1 str2 str3 str1 str2 str3 str1 str2 str3 Res 100 5.26 6.08 5.57 17.23 17.44 16.79 22.02 21.93 22.16 Res 166 5.91 4.81 3.13 10.23 11.91 9.21 14.19 16.66 12.85 Res 233 3.30 3.75 3.43 10.15 9.32 5.88 14.49 12.55 8.21 Hor. 4 Res 100 5.32 5.58 5.93 17.26 16.88 16.17 21.92 20.94 21.35 Res 166 3.42 4.07 3.24 8.63 8.00 9.04 12.05 11.71 14.08 Res 233 3.65 3.07 3.29 5.27 7.14 5.38 8.25 10.08 7.90 Hor. 6 Res 100 5.79 5.51 5.98 17.13 17.90 17.44 22.03 20.56 22.17 Res 166 2.96 2.68 2.72 10.22 8.33 9.08 9.82 11.47 11.57 Res 233 1.52 2.00 1.70 4.81 4.13 4.24 5.64 7.20 5.11 Table 3.7: Simulation results for 3 homogeneous sensors without using de- tection but with partial overlap as shown in Fig. 3·5. See Fig. 3·6 - Fig. 3·8 for the graphical version. MD Horizon 3 1 5 10 Res[30, 30] 5.69 16.93 30.38 Res[50, 50] 4.61 16.11 25.89 Res[70, 70] 4.26 15.31 21.48 Horizon 4 Res[30, 30] 5.02 15.92 20.61 Res[50, 50] 3.94 9.46 12.66 Res[70, 70] 3.35 8.58 12.48 Horizon 6 Res[30, 30] 4.62 15.66 19.56 Res[50, 50] 2.92 8.22 10.89 Res[70, 70] 2.18 4.87 7.18 Table 3.8: Bounds for the simulations results in Table 3.7. When the horizon is short, the 3 MPC algorithms execute more observations per object than were used to compute the bound, and therefore, in this case, the bounds do not match the simulations; otherwise, the bounds are good.
  • 74.
    52 Figure 3·6: Thisfigure is the graphical version of Table 3.7 for horizon 3. Situation with no detection but limited visibility (X=’car’, ’truck’, ’military’) using πi(0) = [0.70 0.20 0.10]T ∀ i ∈ [0..99]. There were 7 visibility-groups: 20x001, 20x010, 20x100, 12x011, 12x101, 12x110, 4x111. The 3 bars in each sub-graph are for ‘str1’, ‘str2’, ‘str3’ respectively. The theoretical lower bound can be seen in the upper-right corner of each bar-chart. Figure 3·7: This figure is the graphical version of Table 3.7 for horizon 4.
  • 75.
    53 Figure 3·8: Thisfigure is the graphical version of Table 3.7 for horizon 6. with 81 parameter combinations, 100 Monte Carlo runs for each combination, 4 states, 3 observations, 2 sensors (6 actions) and 7 visibility groups as in Fig. 3·5 required 5 days to compute and entailed solving on the order of 2 million POMDPs. (As a reality check, the number of seconds in 5 days is 432,000 and 432, 000/2.0e6 = 0.216 sec / POMDP, which makes sense). The problem with handling sensor visibility in this way is that we are again in the land of combinatorics, which does not scale well for large numbers of sensors. If there are 3 sensors, then in general there are 23 −1 possible combinations for how sensors can cover an area (every combination accept the trivial “no sensor has visibility” combination is considered), and POMDP subproblems must be solved for each combination. However, at least in real-world scenarios, sensors are not likely to all be in one geographic location, and there will not be the need to solve POMDP subproblems according to every possible combination of sensors with visibility in that region. In terms of the computational complexity of our RH algorithms, the main bottleneck is obviously the solution of the POMDP problems. The LPs solved in the column generation approach are small and are solved in minimal time. Solving the POMDPs
  • 76.
    54 required to generateeach column (one POMDP for each visibility group in cases with partial sensor overlap) is tractable by virtue of the hierarchical breakdown of the SM problem into independent subproblems. It is also possible to parallelize the POMDP computations and even the columns generated in Column Generation using multi-core CPU or GPU processors. Our results suggest that RH control with modest horizons of 2 or 3 sensor actions per location can yield performance close to the best achievable performance using mixed strategies that are resource-conservative. This result is all the more significant in that there are a number of hyperplanes that support the optimal value function that is super- exponential in the planning horizon (PBVI does not necessarily capture all of these hyperplanes); a shorter horizon is a huge computational savings. If shorter horizons are used to reduce computation, then an approach that samples mixed strategies by using the smallest amount of resources (while still using resources) is preferred. These results also show that, with proper SM, geographically distributed sensors with limited visibility can be coordinated to achieve equivalent performance to centrally pooled resources.
  • 77.
    55 Chapter 4 Adaptive SMwith State Dynamics This chapter considers several extensions to the basic SM problem in Section 2.2. We have thus far primarily concerned ourselves with problems involving stationary states at the locations being investigated and sensor platforms that can observe locations in any order they choose. The stationary state assumption is a significant limitation, and so is the assumption that locations are always visible to automata. To develop a more realistic algorithm, we consider two alternative extensions that generalize the baseline model of Section 2.2 to handle either per location state dynamics or known per location visibility dynamics. While we do not attempt to include all of these additional model features in one unified algorithm, this chapter provides the basis for subsequent research in this domain. Much of the formulation in this chapter is similar but not identical to Section 2.2. Slight differences exist concerning indexing and variable definitions (e.g. xi versus xi(t), t ≤ T − 1 versus t ≤ T etc. . . ), so we reproduce relevant parts of the formulation here instead of referring to Section 2.2. 4.1 Time-varying States Per Location The first extension we consider is a relaxation of the assumption that each location has a static state. As previewed with Fig. 2·7, Fig. 2·8 and Fig. 2·6, in this section we assume that the state at each location has Markov dynamics. We use a Markov Birth-Death Process, generalized to handle a discrete set of possible “live” states per location, to
  • 78.
    56 model the potentialfor new objects to show up unexpectedly or for previously known objects to disappear without warning. Since the per location states are unobservable, each of the N locations will have an associated Hidden Markov Model (HMM) that describes the state dynamics for that location. Our decomposition approach requires that a common set of actions (sensor modes that consume resources) be used across lo- cations, but otherwise this formulation allows the HMM (states, transition probabilities, observation probabilities) to vary across locations. Every distinct HMM will necessitate its own POMDP subproblem solution, so the flexibility comes with a loss of solution generality. Assume there are a finite number of locations 1, . . . , N, each of which may at a particular time have an object of a given type or may be empty. Let there be S sensors, each of which has multiple sensor modes indexed as m = 1, . . . , Ms, and assume that each sensor can observe a set of locations at each discrete time-instant (stage) with a mode selected per location. Let xi(t) ∈ {0, 1, . . . , D} denote the state of location i at time t, where xi(t) = 0 if location i is unoccupied, and otherwise xi(t) = k > 0 indicates location i contains an object of type k at time t. Let πi(0) ∈ ℜD+1 be a discrete a priori probability distribution over the possible states for the ith location for i = 1, . . . , N where D ≥ 2. Assume additionally that the random variables xi(t) for i = 1, . . . , N are mutually independent for each time t. Let the state of each of the N locations be governed by an independent Markov chain such as the example in Fig. 4·1. In our model the transition probabilities are specified as a stochastic matrix {pjk} of dimension (D + 1) × (D + 1) that has as elements the (stationary) probabilities pjk = P(xi(t + 1) = j|xi(t) = k). We use these transition probabilities to give locations an arrival probability pai of transitioning from an ‘empty’ state to a non-empty state and a departure probability pdi of transitioning from a non-empty state to an ‘empty’ state. These probabilities may depend on the
  • 79.
    57 Figure 4·1: Anexample HMM that can be used for each of the N loca- tions. pa is an arrival probability and pd is a departure probability for the Markov chain. initial state for departures or final state for arrivals. There are s = 1, . . . , S sensors, each of which has m = 1, . . . , Ms possible modes of observation. Let there be a series of T discrete decision stages with t = 1, . . . , T for sensors to make measurements. Each sensor s has a limited set of locations that it can observe at each stage, denoted by Os(t) ⊆ {1, . . . , N}. At each stage, each sensor can choose to employ one of its sensor modes to collect noisy measurements concerning the states xi(t) of the sensed locations in its Field of View (FOV) (location i is in the FOV of sensor s if i ∈ Os(t)). To define the cost, we assume that a tentative decision is made concerning the identity (state) of each location at the end of each stage. We assume the following model of causality: at time t, a subset of the N locations are sensed using various sensor modes, statistics (beliefs) concerning the states of these locations are updated, and (still at time t) a classification decision (aka “declaration”) is made about the state of each location. The process repeats for each successive time-step until T stages of
  • 80.
    58 time have elapsed.The action space for each stage is then the Cartesian product of the set of feasible sensor-mode assignments for the N locations with the set of tentative classification decisions for the N locations. A sensor action by sensor s at stage t is the set of pairs: us(t) = {(is(t), ms(t)) | is(t) ∈ Os(t), ms(t) ∈ Ms} (4.1) where each pair consists of a location to observe is(t), and a sensor mode (independent for each location) used to observe this location, ms(t), where the mode is restricted to the set of feasible modes given the resource levels for each sensor. We assume that no two sensors observe the same location at the same time in order to minimize the complexity of the associated action and observation spaces. Let ui,s(t) refer to the sensor action taken on location i with sensor s at stage t if any, or let ui,s(t) = ∅ otherwise. Sensor measurements are modeled as belonging to a finite set ys,m ∈ {1, . . . , Ls}. The likelihood of the measured value is assumed to depend on the sensor s, sensor mode m, location i and on the true state at the location, xi(t), but not on the states of other loca- tions (statistical independence). Denote this likelihood as P(ys,m(t)|xi(t), i, s, m). Thus, the Markov models are observed through noisy measurements, resulting in HMMs. We assume that this likelihood given xi(t) is time-invariant, and that the random measure- ments ys,m(t) are conditionally independent of other measurements yσ,n(τ) given the states xi(t), xj(τ) for all sensors s, σ and modes m, n provided i = j or τ = t. Assume each sensor has a quantity Rs of resources available for measurements during each stage, so there is a periodic constraint on sensor utilization. Associated with the use of mode m by sensor s on location i at time t is a resource cost rs(ui,s(t)) to use this mode, representing power or some other type of resource required to operate the sensor: i∈Os(t) rs(ui,s(t)) ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [1 . . . T] (4.2)
  • 81.
    59 This is ahard constraint for each realization of observations and decisions. Let I(t) denote the sequence of past sensing actions and measurement outcomes up to and including stage t − 1: I(t) = {(ui,s(τ), ys,m(τ))| i ∈ Os(τ); s = 1, . . . , S; τ = 1, . . . , t − 1} Define I(0) as the prior probability π(0) = p(x(0)) = N i=1 p(xi(0)). Under the assump- tion of conditional independence of measurements and independence of the Markov chains governing each location, the joint probability π(t) = P(x1(t) = k1, x2(t) = k2, . . . , xN (t) = kN |I(t)) can be factored as the product of belief-states (marginal con- ditional probabilities) for each location. Denote the belief-state at location i as πi(t) = p(xi(t)|I(t)). The belief-state πi(t) is a sufficient statistic to capture all information that is known about location i at time t. When a sensor measurement is taken, the belief-state is updated according to Bayes’ Rule. A measurement of location i with the sensor-mode combination ui,s(t) = (i, m) at stage t that generates observable ys,m(t) updates the belief-vector as: πi(t + 1) = diag{P(ys,m(t)|xi(t) = j, i, s, m)}πi(t) 1T diag{P(ys,m(t)|xi(t) = j, i, s, m)}πi(t) (4.3) where 1 is the D + 1 dimensional vector of all ones. Eq. 4.3 captures the relevant information dynamics that SM controls with our HMM state formulation. Given the information I(t) at stage t, the quality of the information collected is measured by making an estimate of the state xi(t) of each location i given the available information (the information history of observations and actions and the initial prob- ability vector π(t)). Denote these estimates as vi(t) ∀ i = 1, . . . , N. The Bayes’ cost of selecting estimate vi(t) when the true state is xi(t) is denoted as c(xi(t), vi(t)) ∈ ℜ with c(xi(t), vi(t)) ≥ 0. Typically we assume c(xi(t), vi(t)) to be a 0–1 symmetric cost matrix, or else a matrix with 0 cost along the diagonal and FA and MD cost terms off
  • 82.
    60 the diagonal inthe appropriate locations (relative to which state is not to be missed and which state is a nuisance-detection). The objective of this problem is to estimate the state of each location at each time with minimum error as measured by the number of FAs and MDs: J = min γ∈Γ E γ N i=1 T t=1 c(xi(t), vi(t)) (4.4) subject to Eq. 4.2. The minimization is done over the (countable) space of admissible, adaptive feedback strategies γ ∈ Γ. In this context, a strategy γ is a time-varying mapping from an information set (history) to a sensor action and tentative classification decision. After replacing the hard resource constraint in Eq. 4.2 with an expected-resource-use constraint for each of the S sensors, we have the constraints: i∈Os(t) E[rs(ui,s(t))] ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [1 . . . T] (4.5) We can dualize these resource constraints and create an augmented objective-function (Lagrangian) of the form: Jλ = min γ∈Γ E γ   N i=1 T t=1 c(xi(t), vi(t)) − S s=1 T t=1 λs(t)  Rs − i∈Os(t) rs(ui,s(t))     (4.6) This problem is a lower bound on the original problem Eq. 4.4 with sample path con- straints Eq. 4.2 because every strategy that satisfies the original sample path constrained problem is feasible for the relaxed problem. In order to proceed with this derivation, we need a theoretical justification for why choosing actions for locations on an individual basis will not detrimentally affect the optimal cost on a global basis now that locations have time-varying state. The work of [Casta˜n´on, 2005a] provides such a result for a stationary-state case, we need to generalize
  • 83.
    61 this theory forstate dynamics. The idea that classification performance should not suffer seems rather intuitive considering that the states of each location are statistically independent, however there is still the coupling mechanism governed by resource usage. Consider the following lemma (which uses some terms, e.g. “local strategies”, defined in Section 2.2.1): Lemma 4.1.1 (Optimality of Local Adaptive Feedback Strategies). Given a SM prob- lem with periodic resource constraints, an independent HMM governing the state of each location i ∀ i ∈ [1, . . . , N], and the Lagrange multiplier trajectories λs(t) ∀ s, t, the perfor- mance of an optimal, non-local, adaptive feedback strategy γ is equal to the performance of N, locally optimal, adaptive feedback strategies γi. Proof. We have the following inequality: min γ∈Γ E γ N i=1 T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t)) ≥ N i=1 min γ∈Γ E γ T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t)) (4.7) because on the right-hand side the minimum for each term in the sum can use a different strategy, whereas on the left hand side the same strategy must be used for all N terms. Now, consider the minimization problem for each location i: min E T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t)) We can solve this problem via SDP. We break the decision problem at each stage t into two stages: first, we select ui,s(t) and collect information on object i. Then, we select vi(t), the tentative classification. At the final stage, consider the selection of vi(T) as a function of the complete information state I(T), collected over the entire set of locations,
  • 84.
    62 as: J∗ i (I(T), T)= min vi(T) E [c(xi(T), vi(T))|I(T)] = min vi(T) E [c(xi(T), vi(T))|Ii(T)] ≡ J∗ i (Ii(T), T) because of the independence of xi(T) from other xj(T) and conditional independence of the observations of location i from those of other locations. These independence assumptions imply p(xi(T)|I(T)) = p(xi(T)|Ii(T)). Thus, the optimal decision, vi(T), and the optimal cost-to-go will be a function only of Ii(T) and not all of I(T). Now, assume inductively that for stages τ > t, the optimal cost-to-go J∗ i (I(τ), τ) ≡ J∗ i (Ii(τ), τ) depends only on the information collected at location i, and the strategy for the optimal decision vi(τ) and measurements ui,s(τ +1) for s = 1, . . . , S depends only on Ii(τ) and not all of I(τ). Consider the minimization over the choice of vi(t), ui,s(t + 1), s = 1, . . . , S. Under γ, these are functions of the entire information state I(t). Bellman’s equation becomes: J∗ i (I(t), t) = min vi(t),ui,s(t+1) E c(xi(t), vi(t)) + S s=1 λs(t + 1)rs(ui,s(t + 1)) + E J∗ i (Ii(t + 1), t + 1) I(t), {ui,s(t + 1)} I(t) = min vi(t),ui,s(t+1) E c(xi(t), vi(t)) + S s=1 λs(t + 1)rs(ui,s(t + 1)) + E J∗ i (Ii(t + 1), t + 1) Ii(t), {ui,s(t + 1)} Ii(t) because of the same independence assumptions, which implies that p(xi(t)|I(t)) = p(xi(t)|Ii(t)) and: E J∗ i (Ii(t + 1), t + 1) I(t), {ui,s(t + 1)} = E J∗ i (Ii(t + 1), t + 1) Ii(t), {ui,s(t + 1)}
  • 85.
    63 Hence, by inductionthrough DP, we have shown: min γ∈Γ E γ T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t)) = min γi∈ΓL E γi T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t)) Thus: min γ∈Γ E γ N i=1 T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t)) ≥ N i=1 min γi∈ΓL E γi T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t)) (4.8) where ΓL is the set of admissible, local, adaptive feedback strategies, and γi ∈ ΓL maps Ii(t) to (vi(t), {ui,s(t + 1)}). To complete the proof, note that feedback strategies of the form γ = (γ1, γ2, . . . , γN ) are admissible strategies for the optimization problem on the left. Hence, the optimal local strategies γi achieve equality in the above equation, establishing the lemma. The cost function can be decoupled into the sum of N local cost functions as follows: Jλ = min γ∈Γ E γ   N i=1 T t=1 c(xi(t), vi(t)) + S s=1 T t=1 i∈Os(t) λs(t)rs(ui,s(t)) − S s=1 T t=1 λs(t)Rs   Jλ = min γ∈Γ E γ N i=1 T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t))I(i ∈ Os(t)) − T t=1 S s=1 λs(t)Rs
  • 86.
    64 where I(·) isan indicator function (meaning is context-sensitive). Now using Lemma 4.1.1 we have: Jλ = N i=1 min γi∈ΓL E γi T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t))I(i ∈ Os(t)) − T t=1 S s=1 λs(t)Rs (4.9) To develop a new lower bound for this formulation with HMM subproblems, we write Eq. 4.9 in terms of mixed strategies as we did in Section 2.2.1: Jλ = N i=1 min γi∈Q(ΓL) E γi T t=1 c(xi(t), vi(t)) + S s=1 λs(t)rs(ui,s(t))I(i ∈ Os(t)) − T t=1 S s=1 λs(t)Rs (4.10) where Q(ΓL) is the set of all mixtures of the pure strategies in the set ΓL. On account of the fact that we have a relaxed form of resource constraint in Eq. 4.2, we know that the actual optimal cost in Eq. 4.4 must be lower bounded by Eq. 4.10 because we have expanded the space of feasible actions. This identification leads to the inequality: J∗ ≥ sup λ1,...,λS≥0 Jλ1,...,λS (4.11) from weak-duality in Linear Programming. Eq. 4.11 is the dual of the LP: min q∈Q(ΓL) γi∈ΓL q(γi) E γi N i=1 T t=1 c(xi(t), vi(t)) (4.12) γi∈ΓL q(γi) E γi   i∈Os(t) rs(ui,s(t))   ≤ Rs ∀ s ∈ [1, . . . , S], and t ∈ [1, . . . , T] (4.13) γi∈ΓL q(γi) = 1 (4.14)
  • 87.
    65 where we haveone constraint for each of the S sensor resource pools for each time t and an additional simplex constraint in Eq. 4.14 which ensures that q ∈ Q(ΓL) forms a valid probability distribution. With this decomposition and our new lower bound in place, we are able to formulate individual POMDPs for each of the subproblems according to location-specific dynam- ics we want to model, provided that we do not violate our statistical independence assumptions. We can use the same Column Generation techniques that were presented in Section 2.2, but this time with an expanded set of constraints: we need to use one constraint per-sensor-per-time. The extra constraints imply that the optimal basis de- termined via Lagrangian Relaxation will now have ST + 1 variables, though different variables (pure strategies) will have support at different times. In consequence, we have randomized strategies that mix not only in terms of which sensor is utilized, but when a sensor is utilized. At any given stage, most of these mixture coefficients will be 0, and just a few pure strategies will be employed. The pure strategies that have support will be a time-varying set determined by which sensors can see which objects at various points over the planning horizon. We have already implemented algorithms that handle a Markov Birth Process at each location, and the preceding derivation provides the theoretical justification for modeling not just object arrivals, but object departures. Our existing algorithm, with suitable modifications, can be applied to solve this problem with reasonable computational time. The question is just how the number of generated columns scales with the number of constraints in the Column Generation routine; this is a matter for future research. 4.2 Time-varying Visibility As a surrogate for “locality constraints” on sensing operations, in Ch. 3 we divided locations into sets of groups, termed “visibility groups”, and gave sensors a 0–1 type
  • 88.
    66 (static) constraint forwhich locations they could observe. In this section we consider a model in which sensor-location visibility is time-varying but known and deterministic. Consider a sensor management and object classification problem in which there are a finite number of locations 1, . . . , N, each of which may have an object with a given type or may be empty. In this formulation we assume that locations do not change their class-affiliation, but may change visibility over time. The visibility of locations is taken to be a known, boolean sequence of indicators that determine when locations can be observed on a location-by-location basis; each location has its own visibility trajectory. This model would be appropriate for example with a satellite that is scheduled to pass over an area. Let there be S sensors, each of which has multiple sensor modes indexed as m = 1, . . . , Ms, and assume that each sensor can observe a set of locations at each discrete time-instant (stage) with a mode selected per location. Let xi ∈ {0, 1, . . . , D} denote the state of location i, where xi = 0 if location i is unoccupied, and otherwise xi = k with k > 0 indicates that location i contains an object of type k. Let πi(0) ∈ ℜD+1 be the initial, discrete probability distribution over the possible object types for the ith location for i = 1, . . . , N where D ≥ 2. Assume additionally that the random variables xi for i = 1, . . . , N are mutually independent. Therefore the joint probability π(t) = P(x1 = k1, x2 = k2, . . . , xN = kN |I(t)) represent- ing the collection of all information known about the N locations can be factored into the product of N marginal, per-location, conditional distributions as π(t) = N i=1 πi(t). Let there be a series of T discrete decision stages with t = 0, . . . , T −1 where sensors can make measurements, and assume all locations must be classified at or before stage T (terminal classification cost). Each sensor s has a limited set of locations that it can observe at each stage, denoted by Os(t) ⊆ {1, . . . , N}. At each stage t, each sensor can choose to employ one or more of its sensor modes to collect noisy measurements concerning the states xi of sensed locations where i ∈ Os(t). We assume there are
  • 89.
    67 no data associationproblems concerning where a sensor observation comes from, so observables (measurements) can be assigned to the locations that generated them with certainty. A sensor action by sensor s at stage t is the set of pairs: us(t) = {(is(t), ms(t)) | is(t) ∈ Os(t), ms(t) ∈ Ms} (4.15) where each pair consists of a location to observe is(t), and a sensor mode (independent for each location) used to observe this location, ms(t), where the mode is restricted to the set of feasible modes given the resource levels for each sensor. We assume that no two sensors observe the same location at the same time in order to minimize the complexity of the associated action and observation spaces. Let ui,s(t) refer to the sensor action taken on location i with sensor s at stage t if any, or let ui,s(t) = ∅ otherwise. Sensor measurements are modeled as belonging to a finite set ys,m ∈ {1, . . . , Ls}. The likelihood of the measured value is assumed to depend on the sensor s, sensor mode m, location i and on the true state at the location xi, but not on the states of other locations. Sensor measurements of non-visible locations are constrained to be uncorrelated with the states of those locations and therefore have no information. Denote the likelihood of a sensor measurement of location i as P(ys,m(t)|xi, i, s, m). We assume that this likelihood given xi is time-invariant, and that the random measurements ys,m(t) are conditionally independent of other measurements yσ,n(τ) given the states xi, xj for all sensors s, σ and modes m, n provided i = j or τ = t. A word is required concerning the causality between sensing actions and sensor observations. For our purposes we are considering that a sensor action at stage t generates a sensor measurement that is seen at stage t. Stage t + 1 is the first time in which the information observed at stage t can be acted upon. Each sensor has a quantity Rs of resources available for measurements during each decision stage. Associated with the use of mode m by sensor s on location i at time t
  • 90.
    68 is a resourcecost rs(us(t)) to use this mode, representing power or some other type of resource required to operate the sensor: i∈Os(t) rs(ui,s(t)) ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [0 . . . T − 1] (4.16) This is a hard constraint for each sample-path: each and every possible realization of observations and actions. There are no sensor resource dynamics per se because resource availability at a later stage does not depend on resource expenditures at earlier stages. There is however a time-varying demand for sensing resources according to when the visibility constraints dictate are the best and worst times for observing locations. These constraints model the limited information-processing bandwidth of sensors. The S sensors are treated as a set of finite-capacity resource pools wherein making sensor measurements consumes resources, however resources are renewed with every new time- step; (aside from battery-powered devices), sensing-capacity is a renewable resource. Concerning the set Os(t) of visible locations at stage t, consider the following dy- namics: location i has a deterministic visibility profile, wi,s(t), for each sensor s that is known a priori: wi,s : N+ → {0, 1} where the indicator ‘1’ indicates the location is visible. In this case, we assume that a sensor can look at a series of locations in any order so long as the ordering preserves visibility constraints, so Os(t) = {i | wi,s(t) = 1, i ∈ [1, . . . , N]} ∀ s, t. The sensor platforms themselves have no physical state repre- sentation. In this context, the action space is the set of all feasible sensor modes and feasible sensing locations (for all sensors) at a particular time. Let I(t) denote the sequence of past sensing actions and measurement outcomes up to and including stage t − 1: I(t) = {(ui,s(τ), ys,m(τ)) | i ∈ Os(τ); s = 1, . . . , S; τ = 0, . . . , t − 1}
  • 91.
    69 Define I(0) asthe prior probability π(0) = p(x) = N i=1 p(xi). Under the assumption of conditional independence of measurements and independence of individual states at each location, the joint probability π(t) = P(x1 = k1, x2 = k2, . . . , xN = kN |I(t)) can be factored as the product of belief-states (marginal conditional probabilities) for each location. Denote the belief-state at location i as πi(t) = p(xi|I(t)). When a sensor measurement is taken, the belief-state is updated according to Bayes’ Rule. A measurement of location i with the sensor-mode combination ui,s(t) = (i, m) at stage t that generates observable ys,m(t) updates the belief-vector as: πi(t + 1) = diag{P(ys,m(t)|xi = j, i, s, m)}πi(t) 1T diag{P(ys,m(t)|xi = j, i, s, m)}πi(t) (4.17) where 1 is the D + 1 dimensional vector of all ones. Eq. 4.17 captures the relevant information dynamics that SM controls through the choice of measurement actions. The objective for this formulation is to classify, with minimum cost as measured by FAs and MDs, the state of each location at the end of T stages: J = min γ∈Γ E γ N i=1 c(xi, vi) (4.18) subject to Eq. 4.16. After replacing the hard resource constraint in Eq. 4.16 with an expected-resource- use constraint for each of the S sensors, we have the constraints: i∈Os(t) E[rs(ui,s(t))] ≤ Rs ∀ s ∈ [1 . . . S]; ∀ t ∈ [0 . . . T − 1] (4.19) We can dualize the resource constraints and create an augmented objective-function (Lagrangian) of the form: Jλ = min γ∈Γ E γ   N i=1 c(xi, vi) − S s=1 T−1 t=0 λs(t)  Rs − i∈Os(t) rs(ui,s(t))     (4.20)
  • 92.
    70 This problem isa lower bound on the original problem Eq. 4.18 with sample path constraints Eq. 4.16 because every strategy that satisfies the original sample path con- strained problem is feasible for the relaxed problem. Using Lemma 4.1.1 again, the cost function can be decoupled into the sum of N local cost functions as follows: Jλ = min γ∈Γ E γ   N i=1 c(xi, vi) + S s=1 T−1 t=0 i∈Os(t) λs(t)rs(ui,s(t)) − S s=1 T−1 t=0 λs(t)Rs   Jλ = min γ∈Γ E γ N i=1 c(xi, vi) + T−1 t=0 S s=1 λs(t)rs(ui,s(t))I(i ∈ Os(t)) − T−1 t=0 S s=1 λs(t)Rs where I(·) is an indicator function (meaning is context-sensitive). Jλ = N i=1 min γi∈ΓL E γi c(xi, vi) + T−1 t=0 S s=1 λs(t)rs(ui,s(t))I(i ∈ Os(t)) − T−1 t=0 S s=1 λs(t)Rs (4.21) From here on we can see that this problem variation with time-varying but known object visibility is a straight-forward extension of our existing formulation discussed in Ch. 3. A lower bound follows that is analogous to that of Eq. 4.12 - Eq. 4.14. The only difference in the time-varying visibility case is that there is a time-varying set of feasible actions for each subproblem such that for stages when a subproblem is not visible the set of available actions (for the subproblem in question) is constrained to the empty set (waiting). We can again resort to the same Column Generation techniques that were developed in the previous section as an extension to the algorithm described in Section 2.2.1.
  • 93.
    71 4.3 Summary In conclusion,we have developed 2 new lower bounds that apply to SM problems that involve time-varying (HMM) states or known, time-varying location visibility (e.g. for application to satellite sensing operations). These lower bounds are useful for planning with RH control techniques. We have described how the Column Generation algorithm of Section 2.2.1 can be extended to these more general problem formulations, and we have conducted simulations along these lines for the case where location states use Markov Birth Process dynamics (as mentioned in Fig. 2·7, Fig. 2·8 and Fig. 2·6). We expect that the same type of RH control algorithms described in Ch. 3 using Column Generation with mixtures of pure strategies over sensors and over time (i.e. time-varying Lagrange multipliers) will give near-optimal performance in a computationally tractable amount of time.
  • 94.
    72 Chapter 5 Adaptive Sensingwith Continuous Action and Measurement Spaces In this chapter, we present an alternative approach for the inhomogeneous adaptive search problems studied in [Bashan et al., 2007,Bashan et al., 2008] based on the con- straint relaxation approach developed in [Casta˜n´on, 1997, Casta˜n´on, 2005b, Hitchings and Casta˜n´on, 2010]. Our approach decomposes the global two-stage SM problem into per location two-stage SM problems, coupled through a master problem for partition- ing sensing resources across subproblems, which is solved using Lagrangian Relaxation. The resulting algorithm obtains solutions of comparable quality to the grid search ap- proaches proposed in [Bashan et al., 2007,Bashan et al., 2008] with roughly two orders of magnitude less computation. Whereas in Ch. 3 the computation went towards considering a relatively narrow set of actions with a handful of possible observations over a long (deep) horizon, this chapter considers a continuum of actions and observations with a shallower horizon. In this chapter, the states of locations are binary-valued: X = {‘empty’, ‘occupied’}. Therefore this chapter basically implements the ‘search’ mode of Ch. 3 for the shortest horizon considered in that chapter but with a much higher resolution sensor model. The remainder of this chapter proceeds as follows: Section 5.1 describes the two-stage adaptive SM problem of [Bashan et al., 2007,Bashan et al., 2008]. Section 5.2 describes our solution to obtain adaptive sensor allocation strategies. Section 5.3 discusses an
  • 95.
    73 alternative approach basedon DP techniques. In Section 5.4, we provide simulation results that compare our algorithms with the algorithms in [Bashan et al., 2008]. 5.1 Problem Formulation Consider an area that contains Q locations (or ‘cells’) to be measured with R total units of resources (e.g. energy) over T = 2 stages indexed by t, starting with t = 0. Let the cells be indexed by k ∈ [1, . . . , Q]. Define the indicator function Ik = 1 if the kth cell contains an object, and let Ik = 0 otherwise. The variables Ik are assumed to be random and independent, with prior probability πk0 = Pr(Ik = 1). The decision variables at stages t = 0, 1, are denoted by xkt, k = 1, . . . , Q, corre- sponding to the sensor resource allocated to each cell k at stage t. As in [Bashan et al., 2007,Bashan et al., 2008], allocating resources to a cell corresponds to allocating radar energy to that cell, which improves the signal-to-noise ratio (SNR) of a measurement there. A measurement generated for each cell k at stages 0, 1, is described by: Ykt = √ xktIk + vk(t) (5.1) where vk(t) are Gaussian, zero-mean, unit variance random variables that are mutually independent across k and t and independent of Ii for all i. Thus, xkt represents the energy allocated to cell k at stage t and is also the signal-to-noise ratio for the measurement. See Fig. 5·1 for a graphical depiction of this sensor model. Let Yt = [Y1t, Y2t, . . . , YQt] denote the measurements collected across all locations at stage t. For adaptive sensing, we allow the allocations xk1 to be a function of the ob- servations Y0. Thus, an adaptive sensor allocation is a strategy γ = {(xk0, xk1(Y0)), k = 1, . . . , Q}. The resource (energy) constraint requires that feasible adaptive sensor allo-
  • 96.
    74 −6 −4 −20 2 4 6 8 10 12 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Likelihood functions p(Y k0 | I=1; x k0 =[0.81 2.83 5.66 ]) and p(Y k0 |I=0) yk0 p(Y k0 |I;x k0 ) Figure 5·1: Depiction of measurement likelihoods for empty and non-empty cells as a function of xk0. √ xk0 gives the mean of the density p(Yk0|Ik = 1). If the cell is empty the observation is always mean 0 (black curve). cations satisfy: Q k=1 (xk0 + xk1(Y0)) ≤ R for all Y0 (5.2) Note, the resulting optimization problem has a continuum of constraints for all sample paths of the observations Y0. We initially use the following cost function, from [Bashan et al., 2007,Bashan et al., 2008], to develop an adaptive sensing strategy: Jγ = E γ Q k=1 Ik xk0 + xk1(Y0) (5.3) This cost function rewards allocating more sensing resources to cells that are occupied. There are many other variations that can be considered, but this simple form is sufficient to illustrate our approach. The above problem can be solved in principle using SDP, but the cost is not easily separable across stages. Nevertheless, one can still use nested expectation principles to
  • 97.
    75 Figure 5·2: Waterfallplot of joint probability p(Yk0|Ik; xk0) for πk0 = 0.50 for xk0 ∈ [0 . . . 20]. This figure shows the increased discrimination ability that results from using higher-energy measurements (separation of the peaks). approach this problem, as outlined in [Bashan et al., 2008]. Define the belief-state: πk,t+1 = P(Ik = 1|Yt) = P(Ik = 1|Ykt), k = 1, . . . , Q (5.4) and resource dynamics: R1 = R − Q k=1 xk0 where the last equality in Eq. 5.4 follows from the independence assumptions made on the cell indicator functions and the measurement noise. Fig. 5·2 is a 3D plot displaying how increasing energy expenditures improves the discriminatory power of sensor mea- surements (improves SNR), and Fig. 5·3 demonstrates how this increased discriminatory power affects the resulting a posteriori probability for a cell, πk1.
  • 98.
    76 Figure 5·3: Graphicshowing the posterior probability πk1 as a function of the initial action xk0 and the initial measurement value Yk0. This surface plot is for λ = 0.01 and πk0 = 0.20. (The boundary between the high (red) and low (blue) portions of this surface is not straight but curves towards -y with +x.) Let πt = [π1t, . . . , πQt], x0 = [x10, . . . , xQ0], x1 = [x11, . . . , xQ1]. Then: min γ Jγ = min γ E γ Q k=1 Ik xk0 + xk1(Y0) = min x0 E min x1(Y0) E Q k=1 Ik xk0 + xk1(Y0) Y0 = min x0 E E min x1 Q k=1 Ik xk0 + xk1(Y0) Y0 = min x0 E min x1 Q k=1 πk1 xk0 + xk1 (5.5) For each value of x0 and measurements Y0, the inner minimization is a deterministic resource allocation problem with a convex, separable objective and a single constraint Q k=1 xk1 = R − Q k=1 xk0, which is straightforward to solve. The algorithm of [Bashan et al., 2008] then solves the overall problem by enumerating possible values of x0, simu-
  • 99.
    77 lating values ofY0 for each x0 and subsequently solving the inner minimization problem for each x0, Y0 combination to find the best set of adaptive sensor allocations. 5.2 Relaxed Solution In order to avoid enumeration of all feasible resource allocations at stage 0, we adopt a constraint relaxation approach that expands the space of feasible sensor allocations, as in Section 2.2 and [Casta˜n´on, 1997,Casta˜n´on, 2005b]. Specifically, we approximate the constraints implied by Eq. 5.2 with a single resource constraint that constrains the total expected energy use: E Q k=1 (xk0 + xk1(Y0)) ≤ R, xkt ≥ 0 (5.6) This expands the space of admissible sensor allocations, so the solution of the SM prob- lem with these constraints yields a lower bound to the original problem. We introduce a Lagrange multiplier λ ≥ 0 to integrate this constraint into an augmented objective function: Jγ (λ) = Q k=1 E Ik xk0 + xk1(Y0) + λ E Q k=1 (xk0 + xk1(Y0)) − R (5.7) Given λ, Eq. 5.7 is additive over the cells, so for the remainder of this chapter we concentrate on the cost of optimally classifying just one cell. This separability establishes the following result: Theorem 5.2.1. An optimal solution of Eq. 5.7 subject to constraints Eq. 5.6 is achieved by an adaptive sensing strategy where xk1(Y0) ≡ xk1(Yk1) ≡ xk1(πk1). The proof of this follows along the lines of the results in [Casta˜n´on, 2005b,Castanon and Wohletz, 2009]. This result restricts the adaptive sensing allocations for each cell k to depend only on the information collected on that cell, summarized by πk1, and leads
  • 100.
    78 to a decompositionof the optimization problem over each cell k for each value of λ ≥ 0. We analyze this single cell problem next. Consider the expected cost in cell k, ignoring the term λR that is independent of xk0, xk1, which is: min xk0,xk1 Jλ k = min xk0 E min xk1 E Ik xk0 + xk1(Y0) + λ(xk0 + xk1(Y0)) Y (0) (5.8) After computing expectations, the inner minimization becomes: Jλ∗ k,1(xk0, πk1) = min xk1 πk1 xk0 + xk1 + λ(xk0 + xk1) (5.9) The optimal allocation xk1 can be found through differentiation as: xk1 =    πk1 λ − xk0, if x2 k0 < πk1 λ 0 otherwise (5.10) The optimal adaptive strategy at stage 1 has two regions: a region where it is not worth allocating additional resouces in stage 1 to this cell because πk1 is too small and a region where the cell will receive resources in stage 1. Associated with these regions is an optimal cost Jλ∗ k,1(xk0, πk1) given by: Jλ∗ k,1(xk0, πk1) =    2 √ λπk1 if x2 k0 < πk1 λ πk1 xk0 + λxk0 otherwise (5.11) So in general, the overall inner minimization at time 1 has the form: Jλ∗ k,1(xk0, πk1) = πk1 xk0 + λxk0 I x2 k0 ≥ πk1 λ + 2 λπk1 I x2 k0 < πk1 λ (5.12)
  • 101.
    79 Recall that πk1= p(Ik = 1|Yk0; xk0), so it is computed from Bayes’ rule as: πk1 = p(Yk0|Ik = 1; xk0)πk0 p(Yk0|Ik = 1; xk0)πk0 + p(Yk0|Ik = 0; xk0)(1 − πk0) (5.13) We now analyze the boundary defined by the constraint in Eq. 5.11 that determines whether or not one measurement should be made or two and thus which of the two mutually-exclusive terms in the cost function will be active. Starting from Eq. 5.11 and using Bayes’ rule, we define the set Y (xk0, λ) as all yk0 (for a given λ) such that: x2 k0 < 1 λ N(yk0; √ xk0, 1)πk0 N(yk0; √ xk0, 1)πk0 + N(yk0; 0, 1)(1 − πk0) Note that Y (xk0, λ) is a monotone increasing set as πk0 increases, and is monotone decreasing in λ. After some simplifications, this set is equivalent to all Yk0 that satisfy the following inequality: 2 log xk0 < − log λ + √ xk0Yk0 − xk0 2 − log exp(− 1 2 (xk0 − 2 √ xk0Yk0)) + (1 − πk0) πk0 (5.14) When this inequality is true, then Yk0 ∈ Y (xk0, λ) and xk1 > 0, otherwise xk1 = 0. Fig. 5·4 illustrates this boundary. In Fig. 5·5, the boundary that defines Y (xk0, λ) as a function of xk0 and Yk0 for an arbitrary value of λ is shown for illustrative purposes. This surface is basically the log of the posterior probability πk1 shown in Fig. 5·3 with a vertical offset given by −2 log(xk0). The figure Fig. 5·6 gives 3 cross-sections through the boundary/surface of Fig. 5·5. Fig. 5·7 is a two-factor exploration of the parameter-space (λ,πk0) that we used in order to investigate the structure of this boundary. Examining the monotonicity of the boundary is important in order to determine that our algorithm is well behaved, i.e. that it will always converge on the optimal choice of Lagrange multiplier that decouples
  • 102.
    80 Figure 5·4: Costfunction boundary (see Eq. 5.14) with λ = 0.011 and πk0 = 0.18. In the lighter region two measurements are made, in the darker region just one. (Note positive y is downwards.) Figure 5·5: The optimal boundary for taking one action or two as a function of (xk0, Yk0) (for the [Bashan et al., 2008] cost function) for λ = 0.01 and πk0 = 0.20. The curves in Fig. 5·6 represent cross-sections through this surface for the 3 x-values referred to in that figure.
  • 103.
    81 −6 −4 −20 2 4 6 8 10 12 −15 −10 −5 0 5 10 Measurement Boundry for Having/Not−Having Follow−up Action for xk0 =[0.81 2.83 5.66 ] λ=0.0100 p=0.20 yk0 γ(xk0 ,λ) Figure 5·6: This figure gives another depiction of the optimal boundary between taking one measurement action or two for the [Bashan et al., 2008] cost function. For all Y (xk0, λ) ≥ 0 two measurements are made (and the highest curve is for the smallest xk0, see Fig. 5·5 for the 3D surface from which these cross-sections were taken). the individual subproblems. We now develop an expression for the outer expectation in Eq. 5.8, which defines the problem of optimizing xk0, as follows: let p(yk0; xk0) = p(Yk0|Ik = 1; xk0)πk0 +p(Yk0|Ik = 0; xk0)(1 − πk0). Then using Eq. 5.13 yields: E[Jλ∗ k,1(xk0, πk1)] = y∈Y (xk0,λ) 2 λπk1 p(yk0; xk0)dyk0 + y /∈Y (xk0,λ) πk1 xk0 + λxk0 p(yk0; xk0)dyk0 = y∈Y (xk0,λ) 2 λπk0 N(yk0; √ xk0, 1) p(yk0; xk0)dyk0 + y /∈Y (xk0,λ) πk0 xk0 N(yk0; √ xk0, 1) + λxk0 p(yk0; xk0) dyk0 To minimize E[Jλ∗ k,1(xk0, πk1)] with respect to xk0, we have to evaluate the integrals above. Note, the regions of integration also depend on xk0. Although it is possible
  • 104.
    82 Figure 5·7: Two-factorexploration to determine how the optimal boundary between taking one measurement or two measurements varies for a cell with the parameters (p, λ) where p = πk0 (for the [Bashan et al., 2008] problem cost function). Two measurements are taken in the darker region, one measurement for the lighter region.
  • 105.
    83 to take derivativesand try to solve for possible minima, it is simpler to evaluate the integrals by numerical quadrature and obtain the optimal value xk0 by numerical search in one dimension. The above procedure computes the optimal strategy for the relaxed problem of Eq. 5.7 for fixed λ. To obtain a solution to the original problem with the relaxed constraints of Eq. 5.6, we do an optimization over λ: a one-dimensional concave max- imization problem where the direction of ascent is readily identified. A subgradient direction is given by: ∂Jγ (λ) = E Q k=1 (xk0 + xk1(Y0)) − R = Q k=1 y∈Y (xk0,λ) πk1 λ p(yk0; xk0)dyk0 + Q k=1 y /∈Y (xk0,λ) xk0 p(yk0; xk0)dyk0 − R Note, the relevant expectations have already been evaluated by quadrature in the search for the optimal xk0 for each λ, which makes searching for the optimal λ straightforward. The algorithm above obtains optimal adaptive sensor allocations for optimizing Eq. 5.3 subject to the relaxed constraints in Eq. 5.6. To obtain sensor allocations that satisfy the original constraints of Eq. 5.2, we use the sensor allocations {xk0, k = 1, . . . , Q} determined by our procedure above, collect the vector of observations Y0 across all cells, and then replan for the optimal allocation of the remaining resources, enforcing the constraint Eq. 5.2 for the specific observations Y0. This stage 1 optimization problem is a straightforward deterministic separable convex minimization problem with a single additive constraint, and can be solved analytically in finite time by indexing the cells, as described in [Bashan et al., 2008].
  • 106.
    84 5.3 Bayesian ObjectiveFormulation One of the factors that makes the adaptive SM problem in [Bashan et al., 2008] complex is that the objective function Eq. 5.3 does not depend on the observed measurements at stage 1, Y1. As argued in [Bashan et al., 2008], this objective is related to lower bounds in performance such as Cramer Rao bounds on the estimation of Ik or Chernoff bounds on correct classification, particularly for open-loop allocations. The resulting cost is not separable across stages, and is a hybrid of bounds and actual expected performance. A direct approach would have been to define a cost at stage 2 that depends on the measurements of both stages 0 and 1, along with a decision stage that generates an estimate or a classification for each cell, as in [Casta˜n´on, 2005b], and to use DP tech- niques for partially observed Markov decision processes (POMDPs), suitably extended to incorporate continuous-valued action and measurement spaces. We assume that for each cell k at stage 2 there will be a classification decision uk2 ∈ {0, 1}, which depends on the observed measurements Y0, Y1. For any cell k where we have collected Yk0, Yk1, denote the conditional probability πk2 = P(Ik = 1|Yk0, Yk1; xk0, xk1). Let MD denote a constant representing the relative cost of a missed detection at the end of 2 stages where the false alarm cost (FA) is held constant at 1. The Bayes’ cost we seek to optimize is: JBayes = E Q k=1 (MD Ik(1 − uk2) + (1 − Ik)uk2) The surfaces of Fig. 5·8 show how the MD and FA costs vary with the values of xk1 and Yk1 after making the first measurement. These figures are cost-samples for the cost that results after each outcome (xk1,Yk1) at the last stage. The FA classification cost is independent of the amount of energy used in the second measurement, but the MD classification cost is not.
  • 107.
    85 Figure 5·8: Plotof cost function samples associated with false alarms, missed detections and the optimal choice between false alarms and missed detections (for the Bayes’ cost function).
  • 108.
    86 Incorporating the relaxedconstraints of Eq. 5.6, we get an augmented cost that can be decomposed over cells, along the lines of the formulation in the previous section, to obtain the cell-wise optimization objective: JBayes k (λ) = E[(MD Ik(1 − uk2) + (1 − Ik)uk2) + λ(xk0 + xk1)] where xk1 depends on Yk0, and uk2 depends on Yk0, Yk1. Using DP, the sufficient statistic for this problem is the information state πkt for each stage t. The cost-to-go at stage 2 is: J∗ k,2(πk2) = min MD πk2, (1 − πk2) The DP recursion is now: Jλ∗ k,1(πk1) = min xk1 E min MD p(Yk1|1; xk1)πk1 p(Yk1; xk1) , (1 − p(Yk1|1; xk1)πk1 p(Yk1; xk1) ) + λxk1 (5.15) Using this DP recursion yields the optimization problem for xk0: min xk0 E Jλ∗ k,1(πk1) + λxk0 Fig. 5·9 shows a set of plots for the Bayes’ cost function when t = 1 for the proba- bilities πk1 = 0.0991 and πk1 = 0.3493 for λ = 0.0001. The first row in the figure shows cost-samples for the unaugmented cost function, the middle row for Eq. 5.15 and the final row shows the corresponding joint-probabilities for each prior. These figures demonstrate that the higher the value of πk1, the larger the amount of cost associated with a measurement Yk1 that is near 0 (the measurement is ambiguous and so the chance of making a mistake and paying a classification cost is relatively large). Fig. 5·10 de- scribes the boundary between determining a cell to be empty / occupied for the Bayes’ cost function with γ(xk1) defined as the observation value Yk1 that makes the FA and MD costs equal.
  • 109.
    87 Figure 5·9: Thisfigure shows cost-to-go function samples as a function of the second sensing-action xk1 and the second measurement Yk1 for the Bayes’ cost function. These plots use 1000 samples for Yk1 and 100 for xk1.
  • 110.
    88 Figure 5·10: Thresholdfunction for declaring a cell empty (risk of MD) or occupied (risk of FA). From an algorithmic perspective, we compute Jλ∗ k,1(πk1) for a discrete set of points on the unit interval. For each point πk1, a discrete set of observations and a discretized set of allocations xk1 are evaluated to compute the resulting conditional probabilities πk2, and a cost-to-go value J∗ k,2(πk2) is obtained by interpolation. Summing over measurements for each xk1 and multiplying by a scale factor yields expectations (quadrature), and minimizing over xk1 leads to the cost-to-go value at πk1. This procedure is closely related to the PBVI algorithm, see Appendix A.2. A similar procedure takes place at stage 0, except that only one belief-point for πk0 needs to be considered. The above algorithm depends on the value of λ. We find the optimal λ with a search identical to the one in the previous section. The approach described above has several major advantages: first, the cost is additive over stages, and thus allows direct application of DP techniques. Second, the separability and shift-invariance of the costs
  • 111.
    89 allows for computationof a single cost-to-go function Jλ∗ k,1(πk1) for all cells k, which is a significant savings in computation. 5.4 Experiments We now consider some simulation results. Simulations were done using MATLAB on a 2.2 Ghz, single-core, Intel P4, Linux machine. We ran two separate simulation con- figurations. The first was an independent experiment and the second was meant to be compared against the algorithm of [Bashan et al., 2007,Bashan et al., 2008]. In the first configuration, we used 1000 points to discretize the observation space for quadrature, 100 discrete points to search for optimal sensor allocations xk0 with a line search and 500 units of energy, which gives an SNR (= 10 log10 (R Q )) of 6.99 units. The values reported in the experiments are averages over 100 simulation runs using Q = 100 cells. A set of prior probabilities was created for the 100 simulations using 100 independent samples from a gamma distribution with shape 2 and scale 3. The values were scaled and thresholded to fall in the range [0.05 . . . 0.80]. The net result was a vector of prior probabilities with most of its values around 0.10 − 0.20 and with a few higher probability elements. We first focus on comparing the adaptive sensor allocation algorithm using the re- laxation approach of Section 5.2. Fig. 5·11 displays the initial resource allocation as a function of the value of the prior probability of each cell πk0. The amount of resource initially allocated to measuring a cell is a monotonically increasing function of the chance that an object is there, as the cost function rewards spending resources on cells that contain objects. Fig. 5·12 shows a similar behavior for the total expected resource alloca- tions per cell (with the expected value of the follow-up resource allocations (xk1) making up the difference between Fig. 5·11 and Fig. 5·12). The striations seen in Fig. 5·11 are artifacts of the resource allocation quantizations at stage 0, which had a granularity of
  • 112.
    90 around 0.2 unitsof energy. There are more points on these graphs for small values of πk0 because a prior was used with a relatively low probability of object occurrence. Fig. 5·13 is very similar to Fig. 5·12 because the classification cost is completely determined by the total amount of resource spent and the prior probability of an object being present. In terms of computation time, determining the optimal adaptive sensor allocation (for 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 2.5 3 3.5 4 4.5 5 5.5 6 π(0) energy First Stage Energy Allocation vs Prior Probability Figure 5·11: The 0th stage resource allocation as a function of prior prob- ability. The striations are an artifact of the discretization of resources when looking for optimal xk0. the case of non-uniform priors) required around two minutes for our MATLAB imple- mentation. Subsequent evaluation of the performance using Monte Carlo runs required around 10 seconds per simulation. To summarize the performance statistics, the aver- age simulation cost over 100 simulations was 2.85 units, and the sample variance of the simulation cost was 0.40 units. In order to compare our results with the results of [Bashan et al., 2008] for the second simulation configuration, we obtained a MATLAB version of their algorithm with a
  • 113.
    91 0 0.1 0.20.3 0.4 0.5 0.6 0.7 0.8 2 3 4 5 6 7 8 9 10 11 π(0) energy Total Energy Allocation vs Prior Probability Figure 5·12: Total resource allocation to a cell as a function of prior proba- bility. The point-wise sums of the 0th stage and 1st stage resource expenditures are displayed here. specific problem implementation that contained a grid of Q = 1024 cells and used an action space discretized to 100 energy levels. Their algorithm evaluated expectations using 500 random samples for Y0. We used 1000 energy levels and an observation space discretized to 100 observations. The algorithms in [Bashan et al., 2008] assume a uniform prior probability πk0 = 0.01, which reduces the need for a Q-dimensional enumeration to a one dimensional enumeration because of symmetry, so xk0 is constant across all cells k. To get an overall SNR of 10, the resource (energy) level R was set to 10240 so that, on average, there are 10 units of resources per cell. We used 100 Monte Carlo simulations to evaluate the performance of the adaptive sensor allocations. With this second configuration involving an order of magnitude more cells, our MAT- LAB algorithm created an adaptive sensor allocation plan in 9.5 seconds by exploiting the symmetry across cells. Evaluation of performance with the 100 Monte Carlo simu-
  • 114.
    92 0 0.1 0.20.3 0.4 0.5 0.6 0.7 0.8 0.02 0.04 0.06 0.08 0.1 0.12 0.14 π(0) cost Cost vs Prior Probability Figure 5·13: Cost associated with a cell as a function of prior probability. For the optimal resource allocations, there is a one-to-one correspondence between the cost of a cell and the resource utilized to sense a cell. lations required an additional 17.7 seconds to do the 100 simulations. We ran the same experiments using the MATLAB code implementing the approach described in [Bashan et al., 2008], which required 100.3 sec to compute adaptive sensor allocations, and 2.7 sec of simulation time to evaluate performance (because their algorithm does not require in- line replanning to ensure feasibility). The results are summarized in Table I. The results show that our algorithm achieves performance close to that of the optimal algorithm described in [Bashan et al., 2008] with significantly lower computational requirements. The above comparison does not highlight the fact that our algorithm is scaleable for problems with non-uniform prior information. In essence, for our algorithm non- uniform priors would increase computation time by a factor of Q, the number of distinct cells that require evaluation using our relaxation approach. In contrast, computation times for the algorithm of [Bashan et al., 2008] would require enumeration of 1024Q
  • 115.
    93 Average Standard SolutionSimulation Cost Deviation Time Time Relaxation 0.2762 0.14 9.5 17.7 Exact 0.2666 0.15 100.3 2.7 Table 5.1: Performance comparison averaged over 100 Monte Carlo simu- lations. Relaxation is the algorithm proposed in this chapter, while Exact is the algorithm of [Bashan et al., 2008] 0 0.1 0.2 0.3 0.4 0.5 0.59 0.69 0.79 0.89 0.99 0 0.01 0.03 0.05 0.07 0.09 0.1 0.12 0.14 0.16 0.17 π1 (k) J λ 1 (k) Figure 5·14: Cost-to-go from πk1 locations versus 1024 locations, an exponential increase in time, making consideration of nonuniform priors an infeasible problem. As a final set of experiments, we implemented the algorithm of Section 5.3 using the first simulation configuration with 100 cells and MD = 1. We used a grid of 1000 belief- points to represent the cost-to-go function, with cubic spline interpolation for values in between points. The optimal cost-to-go at stage 1 in the Bayes’ cost function is shown in Fig. 5·14 and the corresponding optimal energy allocations are in Fig. 5·15. The results show the expected cutoff where no action is taken once enough certainty exists in πk1.
  • 116.
    94 0 0.1 0.20.3 0.4 0.5 0.59 0.69 0.79 0.89 0.99 0 0.8 1.61 2.42 3.23 4.04 4.84 5.65 6.46 7.27 8.08 π 1 (k) x* 1 (k) Figure 5·15: Optimal stage 1 energy allocations. Fig. 5·16 describes the initial resource allocations. These allocations are symmetrical w.r.t. πk0; since most of the priors are small, the figure has more points for small values. In terms of computation time, this algorithm required about 6 minutes of MATLAB time to obtain the sensor plan, and around 6 seconds to run each simulation. However, this number will not increase substantially for problems with more cells, because the cost-to-go functions are reused across cells. Note also that the computations are trivially parallelizable, as the cost-to-go can be computed in parallel for each information state. In this chapter, we developed alternative algorithms that directly exploit Lagrangian Relaxation techniques by exploring a constraint relaxation approach that replaces the original resource constraints with an approximation. The Lagrangian Relaxation tech- niques are faster and scale to more complex problems of the type addressed in search theory, as shown in our analysis and simulation results. The constraint relaxation tech- niques, coupled with Lagrangian Relaxation, enable hierarchical decomposition of cou-
  • 117.
    95 0 0.1 0.20.3 0.4 0.5 0.6 0.7 0.8 0 1 2 3 4 5 6 7 8 π(0) energy First Stage Energy Allocation vs Prior Probability Figure 5·16: Stage 0 energy allocation versus prior probability pled problems into independent subproblems per location, loosely coordinated by a scalar value that prices resources. Recovering feasibility can be accomplished through solution of on-line re-optimization problems with little loss in performance, as shown in the experiments. Our results provide near-optimal algorithms that scale to larger problems with inhomogenous prior information, which makes them well-suited to RH control approaches.
  • 118.
    96 Chapter 6 Human-Robot Semi-AutonomousSystems The final chapter in this dissertation is devoted to the subject of how to optimize in- teractions between humans and robots working as a team. Some of the issues to be discussed include: what tasks/roles are most appropriate for humans to do versus au- tonomous agents, how machines can adapt to human operators and vice-versa, how a semi-autonomous search and exploitation system can support time-varying levels of au- tonomy that adapt to mission conditions, and the amount of information humans can handle before succumbing to information overload. After discussing these topics, we describe how models for human-machine interac- tions can be empirically validated via the gaming medium. Hence, we design a strategy game for the exploration of these issues. Our game design allows for various control structures to be tested and for examining the performance of human supervisory control when robots use the algorithms developed in the previous chapters for SM. This can help determine summary statistics that robots should report to humans. The purpose of this analysis is to determine which statistics/summary information can be used to maximize human operator situational awareness. 6.1 Optimizing Human-Robot Team Performance The forthcoming discussion summarizes various issues and trade-offs that are involved in optimizing human and robot performance in a team environment.
  • 119.
    97 6.1.1 Differences BetweenHuman and Machine World Models The goal of creating a semi-autonomous part-machine, part-human team with high per- formance is non-trivial because humans do not share a common paradigm for reason- ing that is recognizable to machines and vice-versa. Machine-reasoning systems make decisions based on probabilities (belief-states), performance metrics, cost-sensitivity in- formation, trade-offs between FAs and MDs, feature vectors generated from noisy sen- sors and other quantitative metrics. Human decision-makers choose actions based on intuitive threshold-levels, higher-level, amorphous contextual information, actual or per- ceived patterns in the environment, arbitrary priorities and preferences and from ten- dencies formed by habit. Humans tend to have a wealth of experience at their disposal that is generally valuable and very difficult to parameterize for machine use. Some means of mapping back and forth between human and machine-reasoning frameworks for decision-making is necessary to allow machines to collaborate effectively with human supervisors/team-members. While mathematical models are useful for char- acterizing the most effective behaviors (actions) that robots can perform in various sit- uations, it is difficult to quantify human actions in mathematical form. For this reason, this chapter proposes empirical methods to evaluate the best means of characterizing robot information-states for human consumption and for analyzing the sensitivity of human performance to various types of information reported by robots. 6.1.2 Human Decision-Making Response Time A trade-off is necessary between the amount of time allotted to human operators/team- members for higher-level processing and the utility that such “soft-inputs” have in a real-time decision-making system that operates in an uncertain, dynamic environment. This is another instance of the search versus exploitation paradigm that applies to human reasoning processes. If humans take too long to reach a good decision, the world
  • 120.
    98 around them willhave evolved into a new state that is not necessarily related to the one for which a decision is to be made. If humans are forced to act too quickly, they may not be able to outperform machine systems, and therefore lose their utility in a semi-autonomous system. In order to close the loop and allow robots to collaborate with humans, algorithms used for machine-reasoning need to incorporate an awareness of the time-scale that humans require for choosing courses of action. Tasking humans with decision-making at a time-scale that is at or near the limits of their capability only detracts from human situational awareness and leaves humans prone to making poor decisions. Time-scales appropriate for human interaction need to be established as a function of current and projected near-term environmental complex- ity and need to include the fixed-cost of asking a human operator to switch contexts as well as the time-cost of analyzing a situation to reach a decision. The relative benefit of human supervision must be weighed against how well a machine can perform inde- pendently and an awareness of what else the human could have been doing with that time. This problem is similar to a job-scheduling problem with stochastic setup and job- completion times where the “servers” are human operators. (See [Leung, 2004,Sellers, 1996] for an overview of job-shop scheduling problems). 6.1.3 Human and Machine Strengths and Weaknesses Whereas humans are capable of significant parallel processing (e.g. w.r.t. being aware of and responding to numerous novel stimuli at the same time), they do so on a very slow time-scale relative to autonomous agents. However, automata are typically built to perform just a few select tasks very well and at high-speed; developing algorithms that have generalization capability remains a serious challenge. Humans can currently identify patterns and trends in complex environments that far exceed the ability of any automaton to process. On the other hand, humans are fallible, can become bored or
  • 121.
    99 confused and canact with predilection or supposition or short-sightedness. The larger the task that a human is given, the more contextual information there is to be aware of, the longer it takes human operators to switch between tasks [Cummings and Mitchell, 2008]. Therefore this is a design parameter, and tasks need to be divided up into appropriately sized pieces to maximize human operator effectiveness. One of the largest problems with incorporating input from human operators that oversee multiple robotic vehicles is that it is easy to confuse contexts in switching be- tween tasks and in so doing, to lose situational awareness [Cummings et al., 2005]. If human agents become confused, they are rendered idle until re-situated. The higher the complexity of the mission-space and the more dynamic the state-space, the more likely such loss of awareness is to happen. However, this issue can be quantified and anticipated, which allows for potential problems to be in large part averted. One means of staving off these problems is by providing human operators with key fea- tures/indicators when switching tasks such that information is presented to humans on a need-to-know/prioritized basis (with a certain amount of redundancy). There is a duality relationship between man and machine that can be exploited. Humans can help automata perform better by adding domain-specific knowledge that is not yet modeled (or feasible to model) in machine planning algorithms. Human par- ticipants in the decision-making loop can employ higher level reasoning processes (e.g. handling of contingency scenarios when a vehicle falls out of commission) to add robust- ness and human experience to the search and exploitation process. A semi-autonomous hybrid controller can take the input from both human and non-human controllers and fuse it together in such a way as to leverage the strongest characteristics of each of the component controllers.
  • 122.
    100 6.1.4 Time-Varying MachineAutonomy It is of practical interest to try and develop semi-autonomous object search and exploita- tion algorithms that can operate at dynamic levels of autonomy according to mission complexity [Bruemmer and Walton, 2003, Baker and Yanco, 2004, Schermerhorn and Scheutz, 2009,Goodrich et al., 2001]. The goal is to keep human workload at an accept- able level at the busiest of times while not forgoing the performance enhancements that humans can provide a semi-autonomous control system at less busy times. In routine conditions, a semi-autonomous system may be able to proceed on “autopilot” whereas in dangerous situations, or ones in which there is a lot of uncertainty/rapidly changing en- vironmental conditions, it may be advantageous for humans to step in and assume more authority over UAV activities. It is obviously very important that any robot operating in the field be able to quantify its degree of uncertainty about the external environment and “understand” when its model for the world isn’t matching reality. 6.1.5 Machine Awareness of Human Inputs The amount of input that a human operator can give to numerous vehicles communicat- ing on a one-on-one basis is limited, so the duration and objective of these interactions needs to be delimited by a protocol that is adaptive to circumstances [Cummings et al., 2005]. Allowing humans to participate in search and exploitation missions requires de- veloping sophisticated Graphical User Interfaces (GUI’s) that present a dynamic flow of information that is filtered/emphasized in real-time according to the content of the data being displayed. In order to best leverage humans’ cognitive abilities for prediction, in- tuition and pattern-recognition, it is necessary to design software that takes account of the relative importance of the information being displayed and how long a human will re- quire in order to grasp and make use of that information [Kaupp and Makarenko, 2008]. If human operators do not have sufficient time to consider and react to information that
  • 123.
    101 is displayed tothem, that information becomes noise that impairs the completion of other (feasible) tasks, and it should not be displayed [Cummings et al., 2005]. There is also a potential schism that must be handled concerning the activities a human may wish to focus on versus what a predictive decision-making algorithm considers to be the most important issue for human operators to work on. 6.2 Control Structures for HRI There are several degrees of freedom to be explored when it comes to the control struc- tures that are used to regulate, route, sequence and validate decisions made by machines. The number of robotic agents that a single human can oversee while acting as a supervi- sor is a central issue to explore [Cummings and Mitchell, 2008,Steinfeld et al., 2006]. In addition to using humans in a supervisory capacity, human operators can act as peers within a semi-autonomous team. Humans can also assume control of robots in time- varying fashion as a situation warrants. For instance 3 humans could oversee 9 robots (in 3 subgroups) and a 4th human could be tasked to assist one of the 3 subgroups in time-varying fashion as needed. Up to a certain level of dynamism in the environ- ment, this would allow all 3 subgroups to reap most of the benefits of having 2 human decision-makers involved at all times [Nehme and Cummings, 2007]. In addition to exploring the best choice for optimal team size and composition, the tasks that human/non-human operators perform best can vary over time. Tasks can be partitioned according to several different static and dynamic strategies [Gerkey and Matari´c, 2004]. As a first pass, a catalog of activities can be created, and then hu- mans can be given a fixed subset of the activities that they perform well and robots the complementary subset. Tasks can be partitioned between humans and machines based upon geographical constraints, which makes sense if humans are navigating vehicles and doing their own information-gathering. Tasks can preferentially be given to a human
  • 124.
    102 (machine) decision-maker androbots (humans) can be used as a fall-back option. Dif- ferent humans have heterogeneous capabilities and robots may as well. Task-allocations should take such heterogeneity into account [Nehme and Cummings, 2007]. Tasks can be partitioned based on a time-window threshold at which point humans will not have enough time to accomplish the task, so therefore a robot must take responsibility. Tasks can be allocated based on situational awareness metrics that machines use to gauge hu- man preparedness [Cummings et al., 2005]. Decisions must be made in a collaborative fashion, however, such that human and non-human decisions complement each other versus impede each other [Dudenhoeffer, 2001]. A mixed initiative control mechanism is necessary in order to incorporate both robot and human decisions into a single, unified plan of action. Human operators may be able to quickly generate a few candidate plans of action without being able to determine which one is best. If people can help “narrow the playing field” for autonomous agents, the agents can concentrate their processing power on looking more deeply into such a refined/limited set of strategies. In as much as possible, it is desirable to use a policy-driven decision-making scheme wherein a human makes a standing decision that remains in effect unless preempted by another human or a more important decision. “Management by exception” is required in order to support a system with one human controlling large numbers of robots [Cum- mings and Morales, 2005]. Threshold policies and quotas concerning resource usage and rate of usage are types of policies that both humans and non-humans can use to good effect. We explore how these decision-making techniques can be exploited to relieve the decision-making burden on human-operators but at the same time keep them as informed as possible about the evolving state of a mission.
  • 125.
    103 6.2.1 Verification ofMachine Decisions Autonomous agents have yet to become fully accepted as team-members alongside hu- mans in a collaborative environment. This issue has been observed in search and rescue settings as well as military ones. It takes time for people to gain confidence in new technology [Cummings et al., 2005]. In order to facilitate the acceptance process of autonomous agents, it is necessary to build machine-reasoning algorithms that not only arrive at verifiably good decisions, but ones that are transparent to a human opera- tor. The issue of not over-trusting or under-trusting automata is a key factor in system performance [Freedy et al., 2007]. In short, we seek to consider the performance benefits of including one or more hu- mans in the SM decision-making loop to guide the actions of semi-autonomous vehicles. We seek to identify tasks that robots do not perform well, that are computationally intractable for automata, or where machines would benefit from an external informa- tion source. Incorporating human input increases robustness to model error, allows the system to be capable of handling anomalies that are outside the scope of its design, and adds a level of redundancy that ensures the robots remain grounded and on track at all times. Conversely, having computerized feedback from robots can help the humans remain situated at all times as well. 6.3 Strategy Game Design Having discussed some of the issues concerned with human supervisory control of teams of robots, this section describes how human operators may enhance the performance of our search and exploitation system. In preceding chapters we have developed algorithms that allow robots operating in dynamic environments of limited size to conduct near- optimal search and exploitation activities. In order to create an algorithm for SM over extended distances, we use a game that explores using a human supervisor to coordinate
  • 126.
    104 the activities ofmultiple teams of robots operating in distinct areas. Our game is a real-time strategy game with one human player and multiple UAVs. The objective of the game is for the human to partition tasks between several teams that are each composed of several UAVs. The game has a clock that counts down as time elapses. We divide the mission space into two or more zones of operation where the operator can task UAVs to (autonomously or semi-autonomously) search and exploit objects. In the autonomous mode of operation the robots are fully responsible for how sensing resources are expended in the search and exploitation process. In the semi- autonomous mode of operation, UAVs use guidance from the human operator before spending sensing resources (“management by consent”). If the human operator tells a UAV to move between zones of operation, it takes the UAV Td seconds to move to the new region, and during this time, the UAV is unable to perform sensing tasks. Td is a design parameter for our game whose value can be set to a large number to raise the level of commitment required to assign a UAV to a new region or lowered in order to decrease the significance of moving a UAV to a new region. While UAVs are in each zone of operation, they choose sensing tasks and expend their sensing resources semi-/autonomously in executing sensing operations to the best of their ability. The human operator oversees the consumption of resources by each UAV and how the total expected classification cost of the objects in each zone changes over time. This information can be used by the human operator to determine if it is time to direct a UAV to move to a new region of operation. The Graphical User Interface (GUI) for our game design is shown in Fig. 6·1. In this figure, an instance of the game is portrayed in which there are two regions for UAV sensing operations for the sake of simplicity. Each region consists of a grid of cells that are color-coded according to the current belief-state for each cell. Seeing as there are 3 primary colors, this game allows up to 3 types of objects to be represented.
  • 127.
    105 The game isshown in a state where one UAV is searching the left region, one UAV is sensing the right region and one UAV is moving from the right to the left region. The sensing resources held by each UAV are shown as a horizontal bar below each UAV’s icon. The hourglass symbol beside the UAV that is placed between the two grids indicates that this UAV is in transit and the operator must wait for that UAV to arrive in the region on the left. Cells with ambiguous information states are shown with gray or “muddy” colors, and cells that are well-classified have colors close to pure red, green or blue. The game displays several summary statistics below each grid/region/zone respec- tively. The “Cost” value and horizontal bar shows the current expected classification cost for the objects in the grid for each region. The “Bound” value and horizontal bar displays the lower bound on the classification cost for the objects in a grid that is re- turned from the column generation routine. The “Res” line displays the total sensing resources that are available across all of the UAVs in each grid. The last item, “Delta”, displays a new lower bound value that would result from giving the associated region another allotment of ∆R resources. Moving UAVs between regions allows sensing re- sources to be moved between each of the sensing zones in order to do load-balancing of sensing resources. We have designed this game with a server-client interface in mind. The client pro- gram has in-game buttons to play/pause, stop and restart the simulation. Assuming there are two sensing zones in the game, UAVs can be moved between zones by simply clicking on their icons (assuming they are not already switching between regions). If the game has more than two zones of operation, UAV icons can be dragged over a new region to tell them to move there. The game uses a login screen to allow a player to log on to the server that runs the back-end code (the sensor management system) before the game begins; the clients are essentially thin clients. The server supports one client at a
  • 128.
    106 Figure 6·1: GraphicalUser Interface (GUI) concept for semi-autonomous search and exploitation strategy game.
  • 129.
    107 time, and communicationis based on a TCP/IP protocol. This design allows multiple clients (multiple human players) to be involved in future game designs. After a game is finished, the server can collect statistics on the behavior of each player and the client program keeps track of this information while a game is in progress. The server uses (continues to use) C++ and C code, whereas the client can be implemented using Qt, Java, DirectX or even in MATLAB. The server has one thread for each client for com- munication purposes and a dedicated thread for running computations (to create sensor plans via column generation). The clients have one thread for communicating with the server and one thread for handling user input (so that the process of communicating with the server is independent of the process of communicating with the player). This game design can be used to test our hypotheses concerning human performance. There are multiple dimensions to be explored: • statistics for situational awareness • statistics for human decision-making response time • statistics per operator for inclination to search versus exploit • statistics per operator of how performance improves with gaming experience • best ratio of UAVs per operator • best ratio of the number of sensing zones per operator • value of numerical versus graphical versus auditory feedback to guide operator • best arrangement of GUI elements to maximize software intuitiveness • relative value of policy-based heuristics (fully autonomous) versus human-guided decision-making for moving platforms between regions • operator overload as a function of: – environmental complexity and dynamics – simulation rate
  • 130.
    108 – number ofsensing regions/zones – number of UAVs in the simulation • human performance as a function of detail in simulation cost information: – expected cost per region, resources per region only – all of the above and projected cost per region using lower bound solution – all of the above and projected cost sensitivity to additional resources Drawing from all of these issues, we make the following hypotheses: • human situational awareness is a function of: 1. number of robots per team 2. number of teams 3. number of zones for sensing operations 4. number of locations per zone 5. granularity of resource allocations 6. simulation rate 7. rate of environmental dynamics • an optimum exists for the best number of robots per team and number of teams per human operator • human performance will degrade linearly with increasing simulation rate up until a certain threshold and non-linearly thereafter • situational awareness can be improved by using per-zone summary statistics that describe the time-varying performance of each robotic team using visual and au- ditory clues, and operator overload can be mitigated by simultaneously using au- ditory and visual channels for delivering status information • projections formed from our lower bound computations will increase the perfor- mance of human planners in a statistically significant way
  • 131.
    109 • management byconsent policies for sensor resource allocation will have the best performance up to a certain level of environmental complexity and simulation rate at which point management by exception will be the better strategy • human operators will not be able to effectively use per location probabilistic infor- mation whether it is represented as colors or in some other form unless the game is trivially simple • humans will require playing the game 10 or more times to become proficient • operator boredom will contribute to UAVs being assigned to new regions more often than they should be Our hypotheses concerning situational awareness can be tested using a fractional factorial design of experiments set of simulations. The control experiment for how well a human can interact with a team of robots can be handled by taking statistics on the performance of a human operator working with a single robot in a single sensing region. We can quantify how well the autonomous sensor planning algorithms perform by comparing their performance within a single region of operation with that of a human who is tasked to manually plan sensing operations for the same experimental setup. In both cases Monte Carlo runs with various operators can be used to average out experimental uncertainty concerning variability in human performance. After collecting data across a number of simulation runs with multiple players, we can statistically quantify the significance of each of these hypothetical performance fac- tors. We can empirically determine the number of robots per human that maximizes performance at the same time. We can study the relative value of summary statistics for the state of the whole game and for each zone in the game by running simulations in which different operators have different statistics exposed to them and by watching how the performance of these operators differs. It is relatively straightforward to test the utility of providing predicted game/simulation cost information to human players.
  • 132.
    110 One question isjust how useful this information will be, and a second question is how this information can best be presented to an operator. A population of human operators can play the game with and without cost predictions per zone and with this information displayed in various ways to determine what the utility of the information is. Our intuition is that game players will perform better having cost-sensitivity infor- mation available while making resource allocation decisions. We think that playing the game using the lower bound on classification cost will also be an advantage to players versus playing the game with current expected classification cost statistics alone. We believe that per UAV resource information for a single-dimensional resource pool will be useful for operators, but that information on multi-dimensional resource pools will overload operators. This game design can be used as a test-bed for future work in the domain of semi- autonomous human and machine teams. In this chapter we have highlighted some of the key issues that are involved in the design of an effective hybrid control system and prescribed this computer game as a means of exploring the various trade-offs involved. In the first iteration, we envision a single-player game, but in successive versions, we anticipate no difficulty in incorporating input from multiple human players using the server-client paradigm. With the gaming medium, it should be possible to explore all of the issues concerning how humans interact with humans and humans interact with machines in such a hybrid, multi-player environment. After a multi-player version of this game has been implemented, it can be used to model realistic search and exploitation scenarios similar to those found in the field.
  • 133.
    111 Chapter 7 Conclusion Viewed froma high-level, this dissertation seeks to address outstanding problems in the domain of optimal search theory with follow-up actions, the trade-off between search versus exploitation and human-computer relations/human-factors. Within this context, we have presented algorithms that allow large, combinatorially complex problems to be broken up into subproblems of a tractable size that can be solved in real-time. We perform these hierarchical decompositions using Lagrangian Relaxation and Column Generation to coordinate the solutions of independently-solved subproblems without losing the fidelity represented in subproblem solutions. 7.1 Summary of Contributions In one of the most important contributions of this dissertation, Ch. 3 describes novel techniques for RH control algorithms based on mixed strategies and a lower bound for sensing performance that was developed in [Casta˜n´on, 2005a,Casta˜n´on, 2005b]. These strategies consider near-optimal (non-myopic, adaptive) allocation schemes for a set of noisy, multi-modal, heterogeneous sensors to detect and classify objects in an unknown environment in the face of resource constraints using a centralized control algorithm. We consider mixed strategies for sensing that employ a handful of possible sensor modes and a discrete set of measurement symbols with deep (far-seeing) decision-trees. A C++/C-language simulator was constructed to implement our RH control algo- rithms, and simulations using fractional factorial design of experiments were performed.
  • 134.
    112 Differences in sensorgeographical placement, sensor capabilities, sensor resource lev- els, planning horizon, and the relative cost of FAs and MDs were considered. We have demonstrated that at least in the simulation scenarios considered, the use of a pure strat- egy for RH control that minimizes expected resource usage (subject to the constraint that sensing activities are performed) has near-optimal performance. We describe the extension of search functionality to an algorithm that was previously used solely for object classification Section 2.2.2. The Search versus Exploitation trade- off is an important aspect of the SM problem that we address near-optimally for the SM problem formulation considered. Another contribution of this dissertation is presented in Ch. 4 in which two possible extensions to the problem formulation of Ch. 3 are developed. The first such extension describes the theoretical basis whereby SM can be conducted in a dynamic environment made up of a set of N locations with independent but time-varying states. These state dynamics are represented with HMMs, and a lemma is provided to show that even without a time-invariant state, it is possible to decouple subproblems by making use of time-varying Lagrange multipliers and an expanded Column Generation algorithm. In an alternative extension, we describe how problems with known but time-varying visibility can be modeled as well, by solving the resource allocation problem in terms of strategies that mix between resource use per sensor per time. Locations with known, time-varying visibility are germaine to such applications as remote-sensing with satellites following predictable trajectories. In Ch. 5, we consider an alternative formulation for SM in a detection context that uses a continuous action space and observation space (Gaussian Mixture Model) with a two-stage sensing horizon. DP and Finite-Grid techniques are used to optimally solve sensing subproblems and a Line-Search is used to find the optimal price of sensor re- sources. Using these techniques for problem decomposition, we near-optimally solve a
  • 135.
    113 more general versionof the problem that was posed by [Bashan et al., 2007] in roughly two orders of magnitude less computing time. First and foremost we avoid performing N dimensional grid-searches while looking for the optimal per-location sensing energy allocations. Lagrangian Relaxation and Duality theory are harnessed for this purpose. We make the argument that these decomposition methods as proposed by [Yost and Washburn, 2000, Casta˜n´on, 2005a, Casta˜n´on, 2005b] can be applied to a wide-range of problems involving search, classification, (sensor) scheduling and assignment problems of which Ch. 3 and Ch. 5 provide prototypical examples. The final contribution of this dissertation consists of the design of a game that ex- plores issues surrounding the best means of allowing humans (operators) to input feed- back into a semi-autonomous system that performs search+exploitation functions with the ultimate goal of developing a near-optimal, mixed initiative, human+robot search team that leverages the strengths of machine algorithms (model predictive control with scripted intelligence) and human intelligence (real-time feedback and adaptation). We propose the design of this game as a means of empirically measuring the most informa- tive type of GUI interface for human operators that maximizes situational awareness and minimizes operator workload. Such a design allows more robots to be controlled per human operator. 7.2 Directions for Future Research There are numerous directions for future work in the domain of SM. First of all, the algorithms we have proposed for RH control using time-varying Lagrange multipliers could be implemented. Column generation is known to have slow convergence properties for large problem instances, so the question of how these algorithms scale in the context of strategies that randomize sensor utilization over sensors and time (i.e. many more multipliers) is of interest.
  • 136.
    114 The implementation ofa game such as the one we have designed to explore the best framework for human supervisory control of robots is another important direction of inquiry. The ultimate goal being the creation of a mixed-initiative system wherein humans are maximally aware of what the automata are doing, robots are well-situated w.r.t. the situational awareness of their human operators, multiple human operators are able to communicate essential information between themselves, and all parties are continually tasked with activities that they are well-suited to perform. Robots that support adaptive autonomy levels are a topic of much interest recently. Ideally automata will be self-sufficient when humans are already over-loaded with other tasks but still designed to be able to incorporate more fine-grained human input when it is available. Robots that are managed by exception and that do not need explicit instructions for each and every task they perform are a long term goal in the Human-Robot Interaction domain and in the domain of human-assisted search and exploitation. A system for search and exploitation that jointly performs near-optimal SM and path-planning would be a direct though non-trivial extension of this research. We are interested in a problem paradigm that does not attribute value to moving a sensor to a lo- cation, but moving a sensor to be within sight of a location. After developing a tractable algorithm for SM with path-planning, a paradigm with risk of platform loss/malfunction can be considered. A near-optimal and adaptive algorithm for decentralized SM would be a significant contribution as well. Tractable algorithms for near-optimal SM with moving targets are interesting and difficult problems to work on. Higher resolution sensor models (more subproblem resolution [Jenkins, 2010]) and support for correlated sensor observations could be investigated. The models we have discussed in this work are appropriate for sensors that make observations with a narrow FOV (e.g. an electro-optical camera with a telephoto lens). Alternatively, research could be conducted for problems where sensors make observations over extended areas at the
  • 137.
    115 same time, whichintroduces a data-association problem. Our algorithms have assumed there is no correlation between the states of various locations. This assumption was a requirement in order to make decomposition tech- niques possible. Additional research work is needed to create tractable algorithms that support correlation of object states across locations. PBVI techniques yield to solution via parellelization methods. Specialized computing hardware such as NVIDIA Tesla GPUs can be leveraged for this purpose to create real- time SM algorithms for problems of realistic size. Also, reinforcement learning and neurodynamic programming methods can be used to generate off-line value function approximations. In future work, approximate and general value functions could be computed off-line, stored and then used to seed solutions of online algorithms.
  • 138.
    116 Appendix A Background Theory Inthis appendix we briefly overview some of the theory and concepts used in this dissertation. We first summarize the definition of a POMDP model and describe how the Witness Algorithm or PBVI can be used to solve POMDPs. Next, we discuss Dantzig-Wolfe Decompositions and Column Generation as a special case. A.1 Partially Observable Markov Decision Processes A Markov Decision Process (MDP) is a dynamic decision problem where the underlying state evolution is modeled by a Markov process, controlled by the decisions, and the state is perfectly observed and used as the basis for making adaptive decisions. There are a wide variety of uses for MDPs, some of which are discussed in [Bertsekas, 2007]. Two dissertations focusing on MDPs and their applications are [Patrascu, 2004, McMahan, 2006]. DP can be used to find optimal adaptive strategies to control a system whose state evolves in discrete time according to MDPs, assuming we know the parameters of the MDP model. An MDP model, despite its usefulness, is not sufficiently powerful in its descriptive ability to represent our SM problems. In our problems, we do not have access to the full state information but only to noisy measurements of the state. As a generalization to an MDP, a Partially Observable Markov Decision Process (POMDP) [Monahan, 1982] is an MDP in which only probabilistic information is available regarding the state of the world. This probabilistic information is summarized into a sufficient statistic called
  • 139.
    117 a “belief” or“information-state”. In a POMDP model, when an agent takes an action it also receives an observation that is probabilistically related to the true state of the world, and these observations can be used along with Bayesian inferencing to learn about the system under observation. The underlying information state of a POMDP at a particular stage is a belief-state, corresponding to the conditional probability of the world state (aka core state/underlying state) given all past observations. Formally, a POMDP is composed of the n-tuple (Xt, Ut, Ot, π1) along with the functions T : Xt × Ut → Xt+1, Yt : Xt × Ut → Ot and Rt : Xt × Ut → ℜ where these sets and functions are defined as follows: • Xt the set of possible discrete states at state t • Ut the set of possible sensor actions at stage t (finite-dimensional) • Ot the set of possible observations at stage t (finite-dimensional) • π1 the initial belief-state • T the state transition function (Markov) with: T (xt, ut) ≡ πt+1 = diag{P(o(t)|x(t) = k, ut)}πt 1T diag{P(o(t)|x(t) = k, ut)}πt • Yt = P(ot|xt, ut), the observation function that relates the sensor action to the environ- ment being observed • Rt = rt(xt, ut), the cost/reward function which gives the immediate cost/reward of a sensor action from a particular state for a T state problem where t = [1, . . . , T]. In general, the objective of a POMDP problem is to select a policy γ that minimizes/maximizes: E γ RT (xT , uT ) + T−1 t=1 Rt(xt, ut) where the policy γt : Xt → Ut and γ = {γ1, . . . , γT }. Using Bellman’s Principle of
  • 140.
    118 Optimality, a DPcost-to-go function can be written as: V ∗ (πt, t) = min ut∈Ut < Rt(ut), πt > + ot∈Ot V ∗ (T (πt, ut, ot), t + 1) P(ot|It, ut) where the quantities It represent the information history (the set of previous actions and observations) up until stage t and πt is the belief-state at time t (a sufficient statistic for It). The inner product < Rt(ut), πt > represents the expected immediate cost/reward for being in belief-state πt and selecting action ut. P(ot|It, ut) is given by: P(ot|It, ut) ≡ P(ot|πt, ut) = x′∈Xt Y(ot|x′ , ut)πt(x′ ) A solution to a POMDP problem has two components. First of all, a value function is constructed which gives the optimal reward/cost as a function of the belief-state. This value function can be used to compute a policy (aka decision-tree) that gives the optimal course of action for the associated belief-state at that stage. For finite-horizon problems, there is a (generally distinct) value function associated with every decision- making stage and the optimal policy is time-varying. For infinite horizon problems, there is just one value function and the policy is stationary (after convergence to the optimal policy). This makes it much more complicated to solve finite-horizon POMDPs versus infinite horizon POMDPs. See Fig. A·1 for an example of an optimal value function. In this example there are two possible states for each location: X = {empty, occupied}. The hypothesis H2 corresponds to the decision that an object is present at location i and H1 indicates location i is empty: Pr(xi) = Pr(H2) = Pr(occupied) and Pr(H1) = 1.0 − Pr(H2). There is one sensor with the generic mode “Measure”. The nodes in this figure have a one-to-one relationship with the nodes in Fig. A·2. In this example, the cost of the measurement action was 0.2 units and the cost of a classification error is 1 unit. The optimal value function as given by the DP Value Iteration method
  • 141.
    119 Figure A·1: Hyperplanesrepresenting the optimal Value Function (cost framework) for the canonical Wald Problem [Wald, 1945] with hori- zon 3 (2 sensing opportunities and a declaration) for the equal missed detection and false alarm cost case: FA=MD. (or approximately given by another method) is the concave hull of these hyperplanes. The classification costs (dependent on P(H2)) give the hyperplanes their slope. The measurement cost raises the level of the hyperplanes but does not change their slope. The optimal value function can be written using hyperplanes called α-vectors as a basis [Smallwood and Sondik, 1973]. The concave (convex) hull of the vectors for a cost (reward) function gives the optimal cost-to-go value for a particular belief-state Figure A·2: Decision-tree for the Wald Problem. This figure goes with Fig. A·1.
  • 142.
    120 (probability vector) π: Vt(π)= min α∈Vt x∈X α(x)π(x) (A.1) Using this set of hyperplanes, the value function backup operation V = HV ′ can be performed in four steps as follows: 1. First the intermediate sets Γu,∗ and Γu,o are required ∀ u ∈ U and ∀ o ∈ O: Γu,∗ ← αu,∗ (x) = R(x, u) (A.2) where R(x, u) is the reward (or cost) received for executing action u in state x. 2. Γu,o ← αu,o i (x) = β x′∈X T (x, u, x′ )Y(o, x′ , u)α′ i(x′ ), ∀ α′ i ∈ V ′ (A.3) where T (x, u, x′ ) is the transition probability function from state x to state x′ and Y(o, x′ , u) is the likelihood of observation o given state x′ and action u. The variable β is a discount-factor for infinite horizon DP, for finite-horizons it is set to 1.0. 3. The next step is to create the sets Γu ∀u ∈ U. This represents the cross-sum of the observations and includes one alpha-vector αu,o from each Γu,oz for z ∈ {1, |O|} : Γu = Γu,∗ ⊕ Γu,o1 ⊕ Γu,o2 ⊕ . . . ⊕ Γu,oL (A.4) where L = |O| and the symbol ⊕ represents a cross-sum operator. 4. The last step is to take the union of the sets Γu which are known as “Q-factors”: V = ∪u∈U Γu (A.5) The value Γu represents the optimal cost-to-go provided that the first action is action u,
  • 143.
    121 i.e., it isa branch of the optimal decision-tree with one less stage to go. See [Pineau et al., 2003] or [Kaelbling et al., 1998] for more details about this formulation. If there are |V ′ | α-vectors in the previous basis for the optimal value function, the first step generates |U||O||V ′ | projections. The second step then generates |U||V ′ ||O| cross- sums. Therefore, although in practice many of the vectors that are generated are domi- nated (and therefore pruned out of the solution set), it is theoretically possible to have |U||V ′ ||O| vectors in the value function for V , with order |X|2 |U||V ′ ||O| time-complexity. There is an exponential growth within a single backup to the number of hyperplanes supporting the concave (or convex) hull in the previous stage, and every one of these new hyperplanes will become part of the problem (burden) just one stage later. Fig. A·3 and Fig. A·4 demonstrate how the complexity of the structure of the convex (concave) hull of the set of hyperplanes representing the optimal value function grows with increasing dimension of the state, and this is a relatively simple example with just 4 possible states. In general after forming the projections, the hyperplanes that are dominated (out-performed) by other hyperplanes must be pruned out of the solution set, and testing every hyperplane against every other hyperplane (for instance, by solving a LP) is a time-consuming operation. Solving a single LP is of polynomial complexity, but solving an exponentially growing number of them is exponentially complex. Monahan provides a survey of POMDP applications and various solution tech- niques [Monahan, 1982]. Sondik’s One-Pass Algorithm [Smallwood and Sondik, 1973] was the first exact algorithm proposed for solving POMDPs. Michael Littman’s Wit- ness Algorithm [Littman, 1994] is a more recent POMDP algorithm that has computa- tional benefits over Sondik’s One-Pass Algorithm. The Witness Algorithm maintains a set of hyperplanes to represent the optimal value function in a POMDP and then sys- tematically generates and prunes the possible next-stage hyperplanes in performing the DP backup (backwards recursion) operation until an optimal value function is found.
  • 144.
    122 Figure A·3: Exampleof 3D hyperplanes for a value func- tion (using a reward formulation for visual clarity) for X = {‘military’,‘truck’,‘car’,‘empty’}, S = 1, M = 3 for a horizon 3 problem. The cost coefficients for the non-military vehicles were added together to create the 3D plot. This figure and Fig. A·4 are a mixed-strategy pair.
  • 145.
    123 Figure A·4: Exampleof 3D hyperplanes representing the optimal value function returned by Value Iteration. The optimal value is the convex hull of these hyperplanes. This figure and Fig. A·3 are a mixed-strategy pair (see Section 2.3).
  • 146.
    124 The optimal valuefunction provides (expected) cost-to-go information for every possible belief-state and thus provides all the information necessary to select an optimal action given a particular belief-state. A.2 Point-Based Value Iteration Pineau recently developed a POMDP algorithm called Point-Based Value Iteration or PBVI that uses sampling to generate near-optimal policies [Pineau et al., 2003]. PBVI samples belief-space and maintains a record of the hyperplane with the best value (i.e., the best-available action) for every belief-point (sample point). The difference between Finite Grid Methods and PBVI is that the former only keeps track of the best value at a belief point whereas the latter keeps track of the best hyperplane, which is enough to be able to reconstruct an approximation of the optimal value function in the neighborhood of the belief-point. When the belief-space can be sampled densely, Finite Grid Methods and PBVI give very good (or perfect) solutions; however, this is intractable in high- dimensional belief-spaces. In [Lovejoy, 1991a, Lovejoy, 1991b] a Finite Grid Method is proposed in which a Freudenthal triangulation is used to tessellate the state-space of the underlying MDP which gives (M +n−1)!/(M!(n−1)!) possible belief-points where n is the number of MDP states and M is the number of samples in each dimension. In gen- eral, the PBVI technique scales much better complexity-wise than the other techniques for solving POMDPs and can be used to solve much larger problems near-optimally. Assume that the set B is a finite set of belief-points for PBVI. There will be one (optimal) α-vector computed for every belief-point in B. Using the PBVI algorithm, the approximate value function backup operation V = ˜HV ′ can be performed with the steps: 1. Γu,∗ ← αu,∗ (x) = R(x, u) (A.6)
  • 147.
    125 2. Γu,o ← αu,o i (x)= γ x′∈X T (x, u, x′ )Y(o, x′ , u)α′ i(x′ ), ∀ α′ i ∈ V ′ (A.7) 3. Using the finite set of belief-points B, the cross-sum step Eq. A.4 is much simpler: Γa b = Γu,∗ + o∈O arg max α∈Γu,o (α · b), ∀ b ∈ B, ∀ u ∈ U (A.8) 4. The last step is to find the best action for each belief-point: V ← arg max Γu b , ∀ u∈U (Γu b · b), ∀ b ∈ B (A.9) As is the case with the exact value backup in Eq. A.2 - Eq. A.5, the PBVI routine creates |U||O||V ′ | projections. However, the support for the value function V is limited in the number of hyperplanes it can have to the size of |B| with a computational time complexity on the order of |X||U||V ′ ||O||B|. Even more importantly, the number of hyperplanes does not “blow up” from one stage to the next as it does with the exact backup operation. Pineau gives more details to this derivation in [Pineau et al., 2003]. A.3 Dantzig-Wolfe Decomposition and Column Generation for LPs In our work, we use decomposition techniques to break multi-location problems into single location problems, coordinated by a master problem. This approach is known as a Dantzig-Wolfe decomposition. Consider the following LP from [Bertsimas and
  • 148.
    126 Tsitsiklis, 1997]: min cT 1 x1 + c T 2 x2 (A.10) subject to D1 x1 + D2 x2 = b0 F1 x1 = b1 F2 x2 = b2 where x1 ≥ 0 and x2 ≥ 0 and Fi are linear constraints that specify a polyhedral set of feasible points. The latter two constraints in Eq. A.10 are not coupling constraints, but the first constraint (with the Di matrices) couples the optimal values of x1 and x2 together. Define Pi as the polyhedra describing the set of all xi such that Fi xi = bi for i ∈ {1, 2}. We can rewrite Eq. A.10 as: min c T 1 x1 + c T 2 x2 (A.11) subject to D1 x1 + D2 x2 = b0 with x1 ∈ P1 and x2 ∈ P2. Now using the Resolution Theorem for Convex Polyhedra, the variables x1 and x2 can be written in terms of a basis of extreme points and extreme rays. Assume there are Ji extreme points and Ki extreme rays in the ith polyhedra. Let the vectors xj i for j ∈ Ji represent the extreme points of the polyhedra Pi. The vectors wk i represent the extreme rays in the polyhedra Pi for k ∈ Ki. Obviously for bounded polyhedra the number of extreme rays is 0. The variables xi can now be written in the form: xi = j∈Ji λj i xj i + k∈Ki θk i wk i
  • 149.
    127 with the boundsλj i ≥ 0 ∀ i, j and θk i ≥ 0 ∀ i, k and with a simplex constraint on the λj i values (we only want to allow convex combinations of the extreme points): j∈Ji λj i = 1 ∀ i ∈ {1, 2} This substitution results in the constraints: j∈J1 λj 1 D1xj 1 1 0 + j∈J2 λj 2 D2xj 2 0 1 + k∈K1 θk 1 D1wk 1 0 0 + k∈K2 θk 2 D2wk 2 0 0 = b0 1 1 (A.12) where the optimization variables are now in terms of λj i and θk i . For a general LP of the form min cT x subject to Ax = b, the reduced cost of variable xi is written as cT − pT Ai where p is a Lagrange multiplier vector (dual variable). Therefore relative to this new problem structure, the reduced costs for λj i can be written as: c T i xj i − [ qT ri1 ri2 ] Dixj i 1 0 = c T i − qT Di xj i − ri1 (A.13) and the reduced costs for θk i are: c T i wk i − [ qT ri1 ri2 ] Diwk i 0 0 = c T i − qT Di wk i (A.14) where the vector pT = [ qT ri1 ri2 ] represents an augmented price vector (Lagrange multiplier) that gives the price of violating constraints in this primal problem. The Revised Simplex Method naturally supplies this type of pricing information as part of its solution procedure, so no extra calculation is necessary. Now, rather than trying to enumerate all of the reduced costs for the possibly very large number of λj i and θk i variables, the best reduced cost (corresponding to whichever non-basic variable would be most valuable to have in the basis) can be found by solving
  • 150.
    128 the auxiliary (related)LPs: min c T i − qT Di xi (A.15) subject to xi ∈ Pi for each of the subproblems. Using a Column Generation procedure, if the LP solution for subproblem i has an optimal cost that is smaller than ri1 and finite, then we have identified an extreme point xj i that implies the reduced cost of λj i is negative. Therefore a new column Dixj i 1 0 T for the variable λj i is generated and added to the master problem. If the LP solution for subproblem i is unbounded, we have identified an extreme ray wk i that implies the reduced cost of θk i is negative. Therefore a new column [ Diwk i 0 0 ] T for the variable θk i is generated and added to the master problem. If the optimal cost corresponding to the ith LP is no smaller than ri1 ∀ i, then an optimal solution to the original (master) problem has been found. Clearly, there is nothing limiting this formulation to just two subproblems; the only limit to the number of subproblems is the available amount of computing time. The application of a Dantzig-Wolfe Decomposition to a linear programming problem results in the method of Column Generation. This technique can be used to solve large systems of linear equations containing so many variables and constraints that they do not fit inside computer memory (or in some cases can not even be enumerated). Consider a system of equations: min cT x subject to Ax = b x ≥ 0 (A.16) where x ∈ ℜn , c ∈ ℜn , b ∈ ℜm and the matrix A is m × n with entries aij ∈ ℜ. Let Ai
  • 151.
    129 denote the ith columnof A. In situations where the matrix A is so large that it may not be feasible to evaluate the product Ax, it is still possible to build up to the optimal solution of cT x iteratively. Let us assume that m ≪ n. An iterative solution may be constructed by using the Revised Simplex Method that just keeps track of the columns Ai of A that are basic (that have support) along with the corresponding values xi. The basis is initialized by either using m artificial variables (with large cost coefficients so they will be driven out of the basis) or with another heuristical method. With every iteration of this procedure, one column is added to the basis (hence the name “Column Generation”), so the basis keeps growing in dimension. Let Ik represent an index set of column indices of A in the basis up to (but not including) the kth iteration of Column Generation. On iteration k, Column Generation solves a Restricted Master Problem: min c T Ik xIk (A.17) subject to AIk xIk = b xIk ≥ 0 according to the basic variables xi ∀ i ∈ Ik and determines whether or not there are any negative reduced-costs associated with these basic columns. (A negative reduced- cost indicates the solution is not yet optimal). Each time a negative reduced-cost is found, k is incremented by 1 and a new basic variable (for the column that had the negative reduced-cost) is added to the set Ik. If no non-negative reduced-costs are found, an optimal solution to the original LP Eq. A.16 has been found. Therefore the Column Generation method iteratively builds up a solution in a larger and larger subspace of the original space ℜn until an optimal solution is found and this can be done even when n = ∞! It is, of course, necessary that the subproblems can be solved in a reasonable amount of time or else iteratively solving subproblems in this fashion is not helpful. In situations where there are many constraints and a tractable number of
  • 152.
    130 variables, the CuttingPlane Method can be applied to the dual problem. See [Bertsimas and Tsitsiklis, 1997] for the full derivation of this material. Williams demonstrates how Column Generation can be applied to a SM problem in his dissertation [Williams, 2007]. The dissertation [Tebboth, 2001] describes Dantzig-Wolfe Decompositions in more detail including how columns can be generated in parallel to speed up the solution process.
  • 153.
    131 Appendix B Documentation forcolumn gen Simulator The purpose of this appendix is to give a high-level overview as to the function of the column gen simulator such that a 3rd party could pick up the program and use it for their own simulations and/or continue to develop the code-base. The style of this appendix is less formal than the rest of the dissertation. This program was built off of the base- line established by Anthony R. (Tony) Cassandra while he did his dissertation work at Brown University. His program pomdp-solve-5.3 provides a lot of the core functionality of this simulator [Cassandra, 1999]. Nonetheless, I spent several years working with his code base, developing and customizing it and in general had to make it work for me. There were something like 60–80K lines of code in Tony’s program before I got to it, and I added around 10–15K, a significant portion of the work went to understanding and rearranging what was already there. Whereas Tony’s interests were typically in the framework of a batch-mode execution of a POMDP solver using an infinite horizon problem formulation, the concerns of this dissertation necessitated a program that could execute many POMDPs in a loop (unimpeded by file-io). In addition these POMDPs needed to have customizable parameters from one iteration to the next for such things as variable sensor resource costs. Later on after beginning to work with “visibility groups”, the POMDP Subproblems also needed to be able to support having a separate action- space for each Subproblem according to the particular set of sensors that had visibility for this Subproblem. Breaking Tony’s batch-mode program into a more modular form able to run in a loop, learning what needed to be reinitialized or not, learning how
  • 154.
    132 to create myown data structures following his conventions, etc. . . took significant (3+ person-months of) effort (before starting on the Column Generation algorithm). There were actually a couple different bugs that came up while using a finite-horizon POMDP model that had to be fixed (and of course were not relevant to the infinite horizon problem and therefore were able to escape his notice). The simulator portion of the program is C++ code while the planning code and the interface with pomdp-solve-5.3 is all written in C code, ‘extern “C” {}’ statements are used to allow this to work. Unfortunately, I found that Tony’s POMDP solver did not work correctly while solving models using a cost formulation. Therefore I solved all POMDPs in this program using a reward formulation and had to convert at the interface of my code with his. B.1 Build Environment In its current form, the column gen program has 3 different use-cases that I have been switching between by using a couple “#if 1 (or 0)” type preprocessor directives in the main.cpp file. This is rather primitive, but it was supposed to be a temporary solution while evaluating what the final use-cases for the program will be. The “#if 1 (or 0)” pre- processor flag in main.cpp, line 145, controls whether or not the C++ simulator will be executed or whether or not the ColumnGenDriver() C-language program/subroutine will be executed. The driver basically just runs the Column Generation planning algorithm for a series of different inputs and is useful for creating graphs of the lower bound as a function of various input values. The “#if 1 (or 0)” flag in main.cpp, line 161, controls whether or not the full range of simulations will be run for all the different combinations of resource levels, horizons, MD to FA ratios and simulation modes (the default case, currently this entails 34 ∗ 100 simulation runs) or if just one batch of 100 simulations will be run. The latter case I have used (in conjunction with the appropriate values of seeds for the random number generator (see below) and appropriate values for each
  • 155.
    133 of these designparameters) to jump-start the simulator in a particular state wherein it was crashing, so I could debug the problem. The GNU Integrated Development Environment (IDE) “KDevelop ver 3.5.3” using “KDE ver 3.5.10” was used to develop this software. The parameters used to configure the project from within the KDevelop->Project->Project Options are the following (for a debug configuration): • Configure Options (General): – Configure arguments: –enable-debug=full (this param is generating warning with Autoconf, needs attention) – Build directory: debug – Top source directory: (blank) – C/C++ preprocessor flags (CPPFLAGS): -D DEBUG – Linker flags (LDFLAGS): (blank) Configure Options (C): – C compiler: GNU C Compiler – Compiler command (CC): gcc – Compiler flags (CFLAGS): -O0 -g3 -L../lpsolve55 Configure Options (C++): – C++ compiler: GNU C++ Compiler – Compiler command (CXX): g++ – Compiler flags (CXXFLAGS): (blank) • Run Options (“Main Program” check-box is checked):
  • 156.
    134 – Executable: columngen/debug – Run Arguments: (as given in the previous paragraph) – Debug Arguments: (as given in the previous paragraph but without the redirection-to-file operator) – Working Directory: column gen/debug/src • Debug Options: – Debugger executable: /usr/bin/ – Debugging shell: libtool – Options: Display static members, Display demangled names, Try setting breakpoints on library loading – Start Debugger With: Framestack This project currently uses the Automake and Autoconf build tools for better or worse. In order to add new files to the project the Makefile.am files must be edited. These files get compiled into Makefile.in files which eventually get turned into Makefiles. The files “Makefile” themselves are temporary/expendable in nature and should not be edited. There was a problem with the ltmain.sh script that is used to configure the project, it seems that it’s referring to the wrong version of the “libtool” script that is in a system folder, in consequence I was originally having problems getting this project to compile with a computer running Linux (Ubuntu 9.04). After determining that the build errors had something to do with improper (inconsistent) versions of the different Autoconf and Automake tools getting run, I manually copied over the ltmain.sh script with one I had from the BU Stormy (Fedora core 4) OS. This was an ugly fix, but solved the problem. The whole build process needs to be reworked, and preferably moved away
  • 157.
    135 from Autoconf andAutomake. Or else someone more familiar with these tools could get in and get these tools working nicely together. I have not been using the Build->Install functionality from within KDevelop, just Build->run automake & friends followed by Build->Build Project. Occasionally I have had problems with the KDevelop environment getting stuck in some kind of intermediate state with the Automake tools and I have run “make distclean” from the command line (from the main project directory) and gotten rid of the old Makefiles and cached build information. (It does not cause any harm to delete the debug subdirectory altogether). Then I have run the Automake tools again and rebuilt the project with the new Makefile. The Automake tools actually create Makefiles out of Makefiles! For the record, the main files that I created + frequently used while working on this project are the following: • main.cpp • simulator.cpp/.hpp • vehicle.cpp/.hpp • task.cpp/.hpp • grid.cpp/.hpp • cell.cpp/.hpp • column gen.c/.h • pomdp solve 5 3.c/.h • global.h • MyMacros.h These files are found in the /src directory of the project. I also worked with the files: • pg.c/.h • pomdp.c/.h
  • 158.
    136 • alpha.c/.h • mdp.c/.h •imm-reward.c/.h some of which are in the same /src directory and some of which are in the /src/mdp subdirectory. Other files were modified only on rare occasions. By convention C language files were given suffices .c and .h and C++ language files were given suffices .cpp and .hpp. The following MATLAB scripts are helpful for either displaying simulation results or debugging such things as the evolution of a belief on a decision-tree: • DisplayPolicyGraph.m: uses MATLAB’s biograph viewer to display a color-coded decision-tree • ReadHyperplanesFromFile.m: helper file for DisplayPolicyGraph.m • ReadPolicyGraphLine.m: helper file for DisplayPolicyGraph.m • PlotLowerBoundPaperResults.m: plots the ROC curve of Fig. 2·4 • PlotValueFunction3D.m: plots the 3D value functions used in Appendix A.1 • PlotValueFunction.m: plots the 2D value functions such as in Fig. A·1 • BeliefEvolution LowerBoundPaper.m: belief-evolution test-case relevant to the simulation results in [Casta˜n´on, 2005a] • test case J measure J total mismatch.m: belief-evolution test-case of Fig. 2·6 • pg tree y1 y2 calcs.m: belief-evolution test-case related to Fig. A·1 B.2 Running column gen In this project the source files are all in the column gen/src subdirectory and its subdirec- tories. There is a limited amount of documentation in the column gen/docs subdirectory and all of the data (as well as figures, some MATLAB scripts etc. . . ) are stored in the
  • 159.
    137 column gen/sensor managementsubdirectory. The column gen program can be executed with one of the following commands (the paths in these commands assume the program is launched from the column gen /de- bug/src directory): 1. Search and Exploit Variation: ./column gen -dom check true -method grid -fg type search -fg purge domonly - proj purge domonly -fg epsilon 1.0e-9 -fg points 1000 -scenario filename ../../sensor management/searchAndExploit ver3.data -pomdp ../../sensor management/searchAndExploit ver3.POMDP >../../sensor management/simulatorOutputFileSearchAndExploit.txt 2. Lower Bound Paper Variation: ./column gen -dom check true -method grid -fg type search -fg purge domonly - proj purge domonly -fg epsilon 1.0e-9 -fg points 1000 -scenario filename ../../sensor management/lwrBoundPaperColumnGen ver3.data -pomdp ../../sensor management/lwrBoundPaper ver3.POMDP >../../sensor management/simulatorOutputFileLowerBoundPaper.txt Most of these parameters are passed on to the underlying pomdp-solve-5.3 POMDP solver code. I added the command line argument for -scenario filename, which allows me to specify my own file of simulation parameters for running the C++ simulator code. Most of these parameters to the POMDP solver are controlling how it prunes hyperplanes and also specifying the use of a Finite Grid (PBVI-type) algorithm versus one of the other 4 supported algorithms. To date I have only used the Finite Grid or Witness algorithm variations. The one interesting parameter of note here is -fg points that specifies the use of 1000 belief-points in this case. See Tony’s documentation for more details about the use of these parameters.
  • 160.
    138 B.3 Outputs ofcolumn gen When the simulator is running simulations (and not just stopping short with using the ColumnGen() function to generate sensor plans or lower bounds) as determined by the two preprocessor flags mentioned in Section B.1, there are two files that the column gen program creates as output. The first file is a (generally verbose) log-file of output that is redirected to file according to the filename that is given as the last argument in the run command (the last section used the example filename of “simulatorOutputFileLower- BoundPaper.txt”). The second filename is generated programmatically according to the current set of simulation parameters and is a comma-separated (csv) file that contains all of the per-simulation-batch statistics that are reported by the RunSimulationCon- figuration() function. The verbosity of the log-file can be controlled by setting the TraceFlags global variable (more details follow in the next section). I typically import the csv output-file into a spreadsheet program, mark-up the columns and analyze the performance in that context. From there numerical information can also be exported to MATLAB (or MATLAB can be used to fscanf() the fields in from the csv output-file), and calculations or plots can be done. B.4 Program Conventions To begin the discussion, several different conventions were followed while working on this program that are worth mentioning. Following these conventions allowed for higher pro- gram clarity and lessened the opportunity for my confusing one context of the program with another while working across multiple files. On occasion the C++ code has “call by reference” arguments and fairly frequently the C code returns multiple arguments via the pointer mechanism. I tried to indicate this was happening by putting the comment at the end of a function call of the form
  • 161.
    139 e.g. “// =>obs”to indicate that obs was being returned by reference or by pointer. I made extensive use of dynamic memory allocation, but stuck with Tony’s version of the malloc calls: XMALLOC(), XFREE() and so on. When I had a dynamically allocated variable, that variable was passed using pointer notation (not array notation), however I attempted to indicate the sizes of all dynamically allocated structures using comments of the form e.g. “pArray/*[2][3]*/” to indicate that despite the fact that “pArray” may have been a pointer of type “double **”, it had been allocated to store a matrix with 2 rows and 3 columns. This type of notation helped me out significantly in not getting confused while changing contexts between what I was working on. I wrote this program assuming that one vehicle would have multiple sensors under its direction and that eventually the sensors would be constrained to moving together (and only looking at things within a certain range of the vehicle). This is still work in progress. Currently there are no constraints on what locations/objects a vehicle’s sensors can look at (other than the 0/1 type visibility constraints). Whereas I would have preferred to set a sensor-centric limit on the locations each sensor can look at (like in the CVehicle (aka sensor-platform) class), the ColumnGen() planning sub-routine was written well before knowing it would be used in this way. Therefore it was easier to store information in the TaskList[] array (an output of and also an input to the ColumnGen() function) to specify which sensors can look at which locations. The TaskList[] global array-variable has location-by-location information stored in it during the ColumnGen() function’s execution (more to follow). At this point in time, the CVehicle class in the simulator is just a container for all the sensors in the program, and no code concerning the positions of sensors (or locations of cells) is active/useful. The global array TaskList[] ought to be a local variable, an argument to the ColumnGen() function, and be passed down through the call-chain. This would fix several different issues, but is still work in progress. As is there is just one vehicle and there is no relationship/no constraints
  • 162.
    140 on the activitiesof the sensors which it holds. The only issue is that the simulation calls pVehicle->update() once per update cycle (were pVehicle is actually an Standard Template Library (STL) iterator), which causes one sensing task in the vehicle’s task- list (CVehicle::m taskList) to be undertaken. A more realistic account of time might entail updating each “vehicle” once for each sensor it contains, or else creating multiple vehicles that are each limited to containing one sensor. Again, this is work in progress. The vehicles maintain vector-resource information and vector-constraints on expected resource expenditure and handle multiple tasks for multiple sensors appropriately. For better or worse they process the tasks for multiple sensors in serial fashion however; if it mattered, this could easily be changed. Another reason that the TaskList[] variable is suboptimal is because it is very similar in name to the CVehicle::m taskList variable, and the two variables have no relation to each other. Additionally, if the Column Generation code was not dependent on the TaskList variable, then it would be possible to run multiple CSimulator objects in parallel (i.e. to support multiple concurrent simulations). I attempt to more or less exhaustively assert every condition I can at the beginning of function-calls, and elsewhere as well. These assertions have the form ‘Assert(x >0, “Something is broken, x <= 0”);’. The string in the second argument of the “Assert()” function is displayed when the assertion fails (and the program terminates at that point). While there were many false alarms that had to be dealt with in working with these assertions (getting them straight/self-consistent), they were nonetheless extremely useful in assuring a consistent+valid state of the program. In addition, this assertion function (at least under Linux) gives a file-name and line-number when it fails, which helps in the debug process quite a bit. In developing code, the sooner one can detect a problem after it occurs, the easier it is to handle and fix. So as not to slow the program down after it has been debugged, the “Assert()” function can be “#define’d” away to a null- function or else a pre-processor directive can be used to comment out the body of the
  • 163.
    141 function. The oneissue is that any assertion of the form “Assert(0,””);” should not really be an assertion, it should be an “exit(-1);”-type statement. Actually Tony wrote an Abort() function that provides a parameter for a textual description of the error condition, I should have been using that function instead of “Assert(0,””);”. After the program is stable (and I believe it is at least fairly stable + well-debugged as is), the Assert() statements can be turned off but the Abort() and exit()-like statements should remain. At that point it would make sense to start turning on compiler optimizations and tuning the algorithm for speed using performance profiling. I also #define’d a series of “VALID [something]()” type macros in global.h that are used to test the validity of the range of some variable in a consistent fashion. Consistency is everything. These macros also reveal a lot of information about the conventions used in this program and so are a very good means of studying how the various variables are used and what types of values for e.g. states or observations or sensor indices etc. . . are allowable. As much as possible, whenever I changed Tony’s code, I attempted to document those changes with comments of the form /* Darin’s Modification [date] – description */. This worked well when I was making precise changes to his code, but was more messy at the interface of what my code and his code were doing. I think at some point with the column gen.c and pomdp solve 5 3.c files, I quit bothering; those files are basically of my authorship now. There are numerous different types of “0” in programming, and I tried to disam- biguate between them to make the code clearer. So first and foremost I used “0” for ints (integers) and “0.0” for floats and doubles. And while “NULL” equates to a “0” for string and pointer comparisons, I used “NULL” anyway for clarity of exhibition. The same goes for using “FALSE” and “TRUE” in C code (or “false” and “true” in C++ code) instead of merely 0 and 1.
  • 164.
    142 I have usedthe acronym “WARN”, as in “WARNING” in places where a piece of code is problematic, prone to failure or otherwise in need of attention. Test code is frequently commented in or out using “#if 1 (or 0)” preprocessor statements. Any code that was completely temporary (and not supposed to remain in the program) was generally set off in a block of text-comments (such as // . . . //) and labeled with a comment “remove on sight”. I frequently used Hungarian notation for variables, so “bReadyToReplan” is by con- vention a binary-type variable. A variable “nSizeX” would be of integer type (one of the variants) and “fLength” would be a floating-point value (type float). Member variables for classes were prefixed with “m ”, so “m bInit” might be used for a true-false mem- ber variable. In general I tried to use a convention such as “policyGraph” rather than “policy graph” for naming variables, but was not entirely consistent. This deserves re- mediation. I also wanted to choose a naming convention (and coding/indentation style) that was different from Tony’s to make it easier to delineate the boundaries between the code we have each written. The lp solve-5.5 program has the habit of sticking things in the 0th position in an array or matrix and then indexing that array or matrix with a 1-based notation (like MATLAB). One example of this is that the objective function coefficients are stored in row 0 of the constraint matrix and then the actual constraints start at row 1. Similarly, while requesting the solution outputs from lp solve-5.5, the primal variable values, dual variable values and I believe the objective function value are all lumped into the same array and have to be indexed appropriately. Originally it was necessary to specify a .POMDP file that defined a POMDP model and an .alpha file to specify the “terminal-values” that were used to initialize the cost function (for a finite-horizon POMDP). It took me a while to figure out that the “terminal-values” argument is actually for initializing the solvePomdp() function which
  • 165.
    143 is at thecore of Tony’s POMDP solver. The .POMDP file structure is quite general and flexible, but the .alpha file structure is much the opposite. Therefore, eventually I got away from using .alpha files and defined my own value-function initialization parameters (i.e. FA and MD hyperplanes) programmatically. Currently, a pair of input files is used to run a simulation. Concurrency must be maintained between the simulation .data file and .POMDP files. There are dependen- cies in several different ways (w.r.t. the dimensions of the state and action spaces). I am currently using “ver3” of these files. The .POMDP file is still used to define the POMDP model parameters (states, actions, observations and immediate rewards). The immediate rewards are actually over-written later on programmatically, but it’s still im- portant to specify dummy reward values in the .POMDP file or else storage will not be allocated, seeing as the immediate rewards are stored in a sparse representation (0-valued immediate rewards are not stored and the immediate reward of an action is inferred to be 0 if no reward is found for that action in the sparse representation). B.5 Global variables in column gen There are actually multiple data-types that Tony has defined for working with immediate rewards. He uses the global variable “gImmRewardList” which is a linked-list whose nodes can represent scalar values, vectors or matrices. See the functions in mdp/imm- reward.c: “updateRewards()” and “updateActionReward()” for more information. A global variable “gProblemType” is used in Tony’s code to set the solver’s behavior to solve either MDPs or POMDPs. The nodes in his gImmRewardList linked-list, of type Imm Reward List, can have the “type” of either ‘ir value’, ‘ir vector’ or ‘ir matrix’, where the type used in program execution will reflect how the immediate rewards were given in the .POMDP file. Currently I am actually programmatically generating actions (the set of feasible ac-
  • 166.
    144 tions) based onthe contents of the .POMDP file as well. I use the actions defined in the .POMDP to specify a template for the actions a sensor can support. (This is a departure from Tony’s paradigm). The (primary) global variables gNumStates, gNumActions and gNumObservations are set as the .POMDP file is read in. This happens very early in the program starting from the initPomdpSolve() function. (I broke the initialization code of pomdp-solve-5.3 into pieces however). After reading in the .POMDP file, my simulation data input file, of type .data, is read in and then, according to the instructions in my file, sensors (sensor actions) are instantiated based on the sensor-action templates in the .POMDP file. Therefore I modify the value of gNumActions on the fly. I read in and parse my simulation parameters in the function ReadSimulationData() in column gen.c. (More to follow). The following is a list of most of the significant global variables that were used within the original pomdp-solve-5.3 program (and are mostly still in use now): • Matrix *pomdp P/*[TotalSensorActions]*/: POMDP model transition prob. • Matrix *pomdp R/*[TotalSensorActions]*/: POMDP model observation prob. • Matrix pomdp Q: POMDP model immediate values for state-action pairs • I Matrix *IP: temporary matrix of transition prob. (used while reading .POMDP file) • I Matrix *IR: temporary matrix of observation prob. (used while reading .POMDP file) • int gNumStates: number of states in the POMDP model • int gNumActions: number of actions in the POMDP model (I rewrite this value) • int gNumObservations: number of observations in the POMDP model • int *gNumPossibleObservations/*[gNumActions]*/: mainly used to ensure that every action generates at least one observation • int **gObservationPossible/*[gNumActions][gNumObservations]*/: controls which branches of the decision-tree are so improbable as to not be worth “walking” (to include in expected cost calculations), also determines the projections that are created in the POMDP backup operation • Imm Reward List gImmRewardList: linked-list that stores immediate reward values • Problem Type gProblemType: should be ‘POMDP problem type’ for our case • double gDiscount: should be 1.0 for a finite-horizon POMDP • Value Type gValueType: should be ‘REWARD value type’, the cost-based formulation
  • 167.
    145 is broken • doublegMinimumImmediateReward: equal to the most costly measurement reward, jump-starts Value Iteration by establishing a lower bound on the costs • double *gInitialBelief: not part of the active code-base, I specify prior probabilities in my own input (.data) file Tony’s code defined the sparse matrices P (transition probabilities), R (observation probabilities) and Q (immediate rewards), which was all well and fine when just one POMDP was being solved, but after starting to work with visibility groups, this was no longer the case. Therefore I did a project-wide search+replace and renamed these variables pomdp P[], pomdp R[] and pomdp Q, but far more significantly, I introduced arguments to a large portion of his hundreds of functions (the entire code-base of 60–80K lines), so that the program is not forced to refer to pomdp P[], pomdp R[] and pomdp Q. I introduce a structure that I call “sensorPayload” that contains these values, as well as a customized version of immRewardList, actionNames[] etc. . . that allows POMDPs with different numbers of actions (or different encodings for actions) to be solved by his code. I should have called “sensorPayload”, “sensorConfiguration”, the name is a bit of a misnomer. At any rate I define one such structure (that represents all the POMDP problem parameters) and pass it in to Tony’s solvePomdp() function to compute a solution. This is the largest single change I made to his program, and it took about a month to do this, create the programmatically defined actions and debug the results. I also had to break certain portions of the code that was in solvePomdp() into pieces and moved some code to per-solution-call setup and shutdown code that comes before or after the call to the solver respectively. I also changed the initialization of the Finite Grid code so that the belief-point grid is generated once per program instead of once per call to the POMDP solver. One last significant change that I made to the pomdp-solve-5.3 code is that I updated it to use the lp solve-5.5.0.13 solver instead of the far older and far inferior lp solve-2.x solver. The newer version of the program is in a whole separate class
  • 168.
    146 of program thanthe older. Nevertheless, this change only affects the Witness Algorithm code, which I stopped using because it was still too slow. We decided that because the Finite Grid method is already an approximate technique, solving LP’s to prune hyperplanes is rather silly, so we use simpler pruning methods. For all the advances to lp solve, I gather it still only runs about 5% as fast as cplex. Despite the fact that lp solve-5.5.0.13 is not currently being used to solve POMDPs, it is still useful to have the LP solver in the program when it comes to iteratively solving the LPs used in Column Generation. I have done a major modification to the way pomdp P[], pomdp R[] and pomdp Q are created from the temporary matrices IP[] and IR[], see the function ReadSimulation- Data() for details. In terms of how the matrices pomdp P[], pomdp R[] and pomdp Q are accessed in this program, take a look at the function showProblemMatrices() that shows typical usage. This is a debug function that prints the values of each of these ma- trices to stdout. Tony has created numerous such debug functions that are very useful for printing debug output such as for hyperplanes, policy-graphs etc. . . to the console. Here is a list of some of the more important global variables and arrays I have added to this program: • int NumTargets: number of locations • int GridSizeX, GridSizeY: are in use but nearly useless until vehicles (sensors) have locality constraints • int NumTargetTypes: number of object types (including one for ‘empty’) • int NumDecisions: number of decision/declaration hyperplanes read in from .data input file • int NumSensorTypes: number of templates for sensors, .data and .POMDP file values must jive • int NumTargetGroups: the number of groups of targets with distinct a priori probabili- ties • int NumVisibilityGroups: determined by counting how many different classes of visibility the prior probabilities are divided into in the .data file • unsigned int *VisibilityGroups/*[NumTargets]*/: defined with the maximum possible
  • 169.
    147 but only usingNumVisibilityGroups elements • Target *TaskList/*[MAX NUM TARGETS]*/: per location solution data after Colum- nGen() is executed, also specifies the inputs for prior probabilities and sensor visibilities to the ColumnGen() function • int TotalSensorCount: number of instantiations of sensors (which are programmatically generated from .POMDP file) • int DistinctSensorActions: the number of (parsimonious) actions specified in the POMDP file • int TotalSensorActions: the number of actions across all (programmatically generated) sensors • int *ParsimSensorStartIndex/*[NumSensorTypes]*/: I want to be able to find the index of the jth mode of sensor i when all of the sensor modes are embedded together in one long list (stored in a sparse format) where there are a different number of possible modes (sensor actions) depending on which sensor is being used. Therefore I’m using the array ParsimSensorStartIndex[] to store the index of the first mode (j = 0) for each of the NumSensorTypes sensors. (This is very similar to how for sparse matrices the beginning row indices are stored in the Matrix structure as row start[row i]). I’m prepending “parsim” => parsimonious to indicate that each (sensor,mode) combination has a unique action index associated with it in this array. Therefore a vehicle which has multiple instances of the same type of sensor in its sensorPayload structure will have each of its equivalent sensors’ modes be mapped to the same set of action indices when referring to the array ParsimSensorStartIndex[]. I’m using this scheme so that I don’t have to store action names and the relative costs for the sensor modes redundantly for each of the duplicate/cloned/equivalent sensors that each vehicle may have. Each of the sensorPayload structures contains a similar array called “sensorStartIndex[]” which has the starting indices for each of the sensors contained in that sensor payload. In general each sensorPayload structure has a distinct/independent sparse encoding for its actions • int *SensorStartIndex/*[TotalSensorCount]*/: sparse mapping/encoding for the first action in a joint list of actions (across all sensors) that corresponds to a sensor • int *SensorTypes/*[TotalSensorCount]*/: type of each programmatically generated sen- sor where the types are defined by the actions in the .POMDP file • int *ActionToSensorMap/*[TotalSensorActions]*/: index of the (programmatically gen- erated) sensor that does a (programmatically generated) action • int *NumActionsPerSensorType/*[NumSensorTypes]*/: number of actions that each sensor in the parsimonious list (from .POMDP file) can do • char **actionNames/*[TotalSensorActions+NumDecisions][40]*/: names for (program- matically defined) actions stored in a sparse format (i.e. all sensors confounded with a
  • 170.
    148 variable number of“actionName” strings per sensor) • char **stateNames/*[NumTargetTypes][40]*/: ragged array of strings for state names • char **observationNames: ragged array of strings for observation names • int *stateList/*[NumTargetTypes]*/: deprecated, used when I had a terminal capture state instead of ‘wait’ actions • double *SensorTimeCost/*[TotalSensorActions]*/: the relative cost of each sensor mode as specified in the .data file • UINT32 TraceFlags: 32-bit unsigned integer that stores bitwise flags used for filtering program output The arrays SensorStartIndex[], SensorTypes[], ActionToSensorMap[] and actionNames[] are all sparsely-defined. The arrays NumActionsPerSensorType[] and ParsimSensorStartIn- dex[] are defined w.r.t. the number of classes of sensor that the actions in .POMDP are divided into. (The .POMDP file lists all the possible actions in one long list, my .data file is responsible for associating one action or another with a class of sensor, i.e. a sensor template. Once the number of actions per sensor template is established, it is possible to specify how many sensors of each type are desired within a simulation). The input files searchAndExploit ver3.POMDP and searchAndExploit ver3.data are a pair as are lwrBoundPaper ver3.POMDP and lwrBoundPaper ver3.data. These files have a lot of parameters between them that need to be set correctly in order for these sparse arrays to function correctly. There are some comments at the top of my *.data files that describe how those files are laid out and the documentation Tony provides for his program remains in effect for the .POMDP files. First and foremost, great care needs to be taken to specify parameters lists that are w.r.t. a parsimonious action list if the parsimonious values are required or w.r.t. the expanded/programmatically generated list of actions if that is required. (If there are 3 sensors that each support ‘mode1’ and this mode has identical statistics for all 3 sensors, then ‘mode1’ would appear once in a “parsimonious” list of actions but 3 times in the expanded/programmatically generated list of actions, which generates 3 actions for the POMDP solver based off of the prototype
  • 171.
    149 ‘mode1’ specified inthe .POMDP file). In the .data files for my simulation parameters, I give a comment line above each row of parameter values that serves to document what that row is for and how many parameters should be there. My input files understand %’s at the beginning of lines and at the end as well. I specify resources and initial lambda values w.r.t. the parsimonious action list and then replicate these values as appropriate when sensors are instantiated from the templates defined in the .POMDP file; currently if there are multiple sensors of the same type in a simulation, then they must share the same values for initial sensor resources and initial lambdas (used to create an initial basis for Column Generation). As a brief overview, imagine the .POMDP file has the actions ‘wait 0’, ‘search 0’, ‘mode1 0’, ‘wait 1’, ‘mode1 1’, ‘mode2 1’, which serves to indicate that there are 2 classes of sensor, that the first sensor has a ‘search’ and a ‘mode1’ action, and that the second sensor has a ‘mode1’ and a ‘mode2’ action. This is the parsimonious action list. (I also described it as a “Distinct” action list). (The tags (suffixes) for which sensor each action belongs to are programmatically over-written when the actionNames[] array is created for the expanded set of actions). The .data file might have a line that instantiates 2 of the first type of sensor and 1 of the latter: ‘3 0 0 1’ (this is the 3rd non-comment line in the file). SensorTypes[] takes on the value of the last 3 params on this line. The indices specify that 3 sensors of types 0, 0 and 1 will be used in the simulator. (All indices in the program are always 0-based, except for certain lp solve-5.5 function calls). This specification would cause a list of actions to be defined (and the value of gNumAc- tions to be modified) such that the (non-parsimonious) actions are: ‘wait 0’, ‘search 0’, ‘mode1 0’, ‘wait 1’, ‘search 1’, ‘mode1 1’, ‘wait 2’, ‘mode1 2’, ‘mode2 2’ (this list would correspond to the first 9 entries in the actionNames[] array. (The actionNames[] array stores the names of the decision (aka declaration or classification) hyperplanes after this list of 9 action names). In this example, TotalNumSensors = 3, TotalSensorActions = 9,
  • 172.
    150 DistinctSensorActions = 6,and SensorStartIndex[] = [0 3 6]. ActionToSensorMap[] = [0 0 0 1 1 1 2 2 2]. The simulator uses 3 separate seeds for random number generation that are used for different purposes. The use of multiple seeds allows concurrency to be maintained across simulation runs. The seed “gCellSeed” controls the states (the types) of each location (cell) in a simulation run. The seed “gMeasurementSeed” controls how the trajectory of random observations evolves. The seed “gStrategySeed” is mainly used for the methods of RH control that employ randomization (i.e. “randomly choose a pure strategy on a per location basis with a distribution given by the measure weights”). By resetting the value of gCellSeed after every series of 100 Monte Carlo simulation runs (but after changing the resource levels, MD to FA ratios, horizon etc. . . ) it was possible to ensure that each batch of 100 simulations used the same trajectory of cell (location) states, but that these states evolved randomly from one simulation run to the next. Two enumerated types are worthy of mention. The first is used for controlling the display of tracing information (i.e. logging of debug information to stdout or a file) in a programmatically controllable fashion (rather than having all program output be fixed). I created my own version of a printf() function, called myprintf(), and added another parameter (a bitwise variable of boolean flags) that is passed in to each myprintf() function call. A global variable called “TraceFlags” implements the other half of the lock+key mechanism. When the global variable TraceFlags has matching bits with one of the descriptive codes given to the myprintf() function, output is generated in the log-file, see Section B.3. Therefore these trace-codes act as filters and control how much if any debug information is stored. Unless a very specific aspect of the program is being tested, then all of the lower-level trace-codes cause output to be generated so fast, so verbosely, that the program can not run while they are active; they are only of use for testing a single function call or chain of low-level function calls. I attempted to do this
  • 173.
    151 in such afashion as to allow certain types of context-sensitive debug information to be displayed to the console (or to a file) to aid in the debugging process. I generally found this mechanism useful, although sometimes it is hard to partition one aspect of the program’s functionality into just one type of tracing operation. To handle the need for a many-to-one mapping, I have myprintf() type statements that output trace’d data if any one of a number of these trace-codes are active using a bitwise-OR. The values of enum Tracecode are: • AllTrace=0: exception case, rather than trace nothing, trace everything • SolverTrace=(1<<0): deprecated • LikelihoodTrace=(1<<1): calculation of likelihoods used for branch probabilities while walking decision-trees • InitHyperplaneTrace=(1<<2): used to determine which hyperplane/policy-graph node a cell/location should use • PolicyGraphTrace=(1<<3): traces calculations while walking decision-trees • CostTrace=(1<<4): traces cost information (classification + measurement) for individ- ual Subproblems • TestTrace=(1<<5): a dummy trace context to be used for whatever debugging a situa- tion requires • IntermedLPTrace=(1<<6): prints the Column Generation LP after it is created and after each column is added • ComputeResUsageTrace=(1<<7): traces cost information (classification + measure- ment) across all Subproblems • ColGenTrace=(1<<8): prints information on each column such as Lagrange multipliers and objective function values as they are added • StrategyTrace=(1<<9): used to display the strategy tree (all pure strategies) after Column Generation and also expected cost information and initial nodes in the strategy tree for each Subproblem • ColGenOutputTrace=(1<<10): the main trace flag for Column Generation besides StrategyTrace, displays numerical information about lambdas, resources, expected costs • GridStateTrace=(1<<11): used in the simulator to show cell/location states as they are generated, useful for debugging purposes with a small grid • NewTaskTrace=(1<<12): trace information concerning a sensing task that is to be added on the queue of tasks
  • 174.
    152 • FinishTaskTrace=(1<<13): traceinformation concerning what happens after a task is completed (and potentially a follow-up task is created) • TaskListTrace=(1<<14): trace information concerning sensing tasks in general • VehicleUpdatesTrace=(1<<15): trace high-level information concerning vehicle (i.e. sensor) resources and outstanding tasks, queue sizes • SimUpdatesTrace=(1<<16): trace information about the task that each vehicle does during each update cycle • SimResultsTrace=(1<<17): trace the final results of a simulation • CellErrorTrace=(1<<18): trace information concerning the ML-classification of each cell at the end of a simulation • OutputTrace=(1<<19): the main trace flag for all simulator output • PrintPGPointersTrace=(1<<20): was used for debugging a policy-graph (PG structure) memory leak • NoTrace=(1<<21): a label for the extent of the trace code flags, provides a NULL or out-of-range value for this enumerated type The trace-codes can be modified dynamically throughout the course of the program, i.e. after detecting an error condition, additional trace-codes can be turned on and a conditional breakpoint can be set to start going through the nitty-gritty details of the program’s state from that point on. Another enumerated type is used to control the method with which the simulator uses mixed strategies for RH Control. I call this the “simulation mode” and use enum “Simulator Mode” to specify its value: • eChooseByMixtureWeight: choose the pure strategy with largest mixture weight • eChooseByProbability: choose a pure strategy randomly to be used for all locations based on a distribution governed by the mixture weights • eChooseByProbabilityPerCell: choose as in ‘eChooseByProbability’, but on a location- by-location basis • eChooseByClassifyCost: choose whichever strategy has the best performance • eChooseByMeasureCost: choose whichever strategy uses the least resources • eNumSimulatorModes: a label for the extent of the simulation mode enums, provides a NULL or out-of-range value for this enumerated type I needed access to the hyperplanes from the solver, and not as stored to a file at the
  • 175.
    153 end of theprogram (after using the “save-all” pomdp-solve-5.3 command-line switch) (this had been the status quo), but in memory during the life-cycle of the program, so I had to dig into the solver code. These hyperplanes are stored in next alpha list and prev alpha list in the solvePomdp() function that Tony wrote. I added a level of indirection that allowed me to use the pointer mechanism to output these lists from solvePomdp(). I only ever bothered to get the hyperplanes from the last stage and penultimate stage however. I needed the latter because the former points into the latter. Therefore I had to wait to destroy the penultimate-stage hyperplanes un- til I was done tracing the policy-graphs to back out the measurement versus clas- sification costs. Policy-graphs are stored in a structure that holds all the data for one stage, see some of Tony’s debug/output routines in pg.c for more info. In or- der to represent strategies, I created a 3D dynamically allocated array: “PG *** strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/”. An individual decision-tree could be represented as: “PG * policyGraph/*[horizon]*/”, however when working with visibility groups, different subsets of sensors will in general have distinct POMDP solutions with different decision-trees, hence the need for the 2nd dimension in this structure. The 3rd dimension with size [TotalSensorCount+1] is needed to rep- resent one whole mixed strategy (as a mixture of TotalSensorCount+1 pure strategies). At points in the code I referred to one of the component pure strategies as a “PG ** policy graph group/*[NumVisibilityGroups][horizon]*/”. Hyperplanes are stored as a linked-list of “AlphaList” nodes and the ids of the nodes correspond to the ids of the hyperplanes. The hyperplane coefficients themselves are stored in the member “alpha[]” of AlphaList. One other important point is that for whatever reason, Tony left the root node in a list of hyperplanes as a dummy/header node (stores summary statistics not hyperplanes) and then all the actual data follows after that in the linked list. He also has 2 separate ways of accessing immediate rewards (the pomdp Q sparse matrix and
  • 176.
    154 the gImmRewardList linked-list).In addition he uses the global variable gCurAlphaVec- tor as an array (basically a vector) that points into his linked-lists of hyperplanes. So he maintains hyperplanes both as a linked-list of nodes and also indexes them with a vector/array as the situation warrants. After the call to solvePomdp() and after tracing the decision-trees, I store per-task information on classification costs and measurement costs in the global array TaskList[]. This methodology with this global array variable is suboptimal because it prevents mul- tiple simulators from being able to run in parallel, and I had the intention of modifying this code to make it more object-oriented; this is work in progress, but at least it works as is. The routine that runs the Column Generation algorithm in the simulator is: double ColumnGen( const SensorPayload *sensorPayloadList/*[NumVisibilityGroups]*/, PomdpSolveParams param, PG **policy graph group/*[NumVisibilityGroups][horizon]*/, double **lambdaOfStrategy/*[MAX COLUMNS+1][TotalSensorCount]*/, const double *R/*[TotalSensorCount]*/, double *initLambda/*[TotalSensorCount]*/, PG ***strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/, int *strategyToColumnMap/*[TotalSensorCount+1]*/, int *pColumnsInSolution, REAL *pSolution/*[TotalSensorCount+1]*/, double *J classify perStrategy/*[TotalSensorCount+1]*/, double **J measure perStrategy/*[TotalSensorCount+1][TotalSensorCount]*/) The return value is the optimal total (measurement+classification) cost of the LP at the end of Column Generation. Occasionally, it is possible that a degenerate mixed strategy
  • 177.
    155 is created (lesspure strategies in the LP’s basis than anticipated) in which case I still deem the result the “optimal” cost. However, when lp solve-5.5 solves the LP for the Column Generation Master Problem, if it does not return back a successful error flag (that the optimal solution was found or that the LP was solved in the pre-solve step), then the program Assert’s false and terminates. The arguments to ColumnGen() have the following purposes: • const SensorPayload *sensorPayloadList/*[NumVisibilityGroups]*/: describes the POMDP problem parameters for each subset of sensors • PomdpSolveParams param: Tony’s main structure that I pass on to his solvePomdp() routine • PG **policy graph group/*[NumVisibilityGroups][horizon]*/: temporary storage that I use repeatedly • double **lambdaOfStrategy/*[MAX COLUMNS+1][TotalSensorCount]*/: stores the tra- jectory of lambda values • const double *R/*[TotalSensorCount]*/: the available resources for each sensor • double *initLambda/*[TotalSensorCount]*/: the initial lambda values used to start Col- umn Generation • PG ***strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/: pure strate- gies output from Column Generation • int *strategyToColumnMap/*[TotalSensorCount+1]*/: mapping that describes the ac- tive columns/the columns with support at the end of the Column Generation process • int *pColumnsInSolution: value returned by pointer for the number of columns (out of a total of (MAX COLUMNS+1)) that were generated in Column Generation • REAL *pSolution/*[TotalSensorCount+1]*/: the mixture weights of each strategy with support • double *J classify perStrategy/*[TotalSensorCount+1]*/: the classification cost (across all N locations) for each strategy • double **J measure perStrategy/*[TotalSensorCount+1][TotalSensorCount]*/: the mea- surement (sensor resource) cost (across all N locations) for each strategy For simplicity, in the Column Generation process storage was allocated for up to (MAX COLUMNS+1 and then only some of that memory was actually used as reported by the value of (*pColumnsInSolution). The array strategyToColumnMap[] is a mapping for the pure
  • 178.
    156 strategies indexed as[0,. . . ,TotalSensorCount+1] back to the range [0,. . . ,MAX COLUMNS+1] and specifies the (TotalSensorCount+1) elements out of (MAX COLUMNS+1) that are part of the solution. For instance lambdaOfStrategy[strategyToColumnMap[i]][j] would give the jth Lagrange multiplier of the ith strategy. In some situations it is possible for the ColumnGen() function to fail to find a non- trivial Column Generation solution (by failing to establish an initial basis of linearly- independent columns for the LP), or to otherwise give back results where e.g. just one pure strategy is used when there are 2 sensors (so one would expect 3 pure strategies in the solution). If less than TotalSensorCount+1 pure strategies have support, then one or more elements of strategyToColumnMap[] are set to -1. These circumstances typically arise at the end of a simulation when few resources are remaining and either no modes are feasible for a sensor given its resource constraints, or else the cost versus benefit of using a sensor mode weighs against the use of any resources in a particular situation. This can happen for instance if all of the cell information states (the probability vectors for each of the locations) are already very lopsided (low entropy), and there is not much uncertainty left in the states of any of the locations. When some sensors effectively have no more resources, but other sensors are still useful, it would be ideal to eliminate the use of the resource-less sensors from consid- eration in the ColumnGen() function and to reduce the size of the POMDPs that are solved in the solvePomdp() algorithm. Currently this is work in progress. The untoward result of not doing this pruning operation is that many more columns can be generated while running Column Generation and looking for precisely the right value of Lagrange multiplier that will satisfy a nearly infeasible constraint. While this does not crash the program, it does waste some time. However if presolving is used with the lp-solve 5.5 solver and constraints are eliminated from the Column Generation LP, it is important to make sure to access the solution (Lagrange multipliers in the solution) in the right
  • 179.
    157 way. For example,the lp-solve 5.5 functions get Nrows() and get Ncolumns() return back the number of rows and columns of the presolved model (after rows or columns are eliminated). The functions get Norig rows() and get Norig columns() return back the number of rows and columns in the LP before presolving. In order to know solution variables relative to the original enumeration for variables and constraints, the function get var primalresult() must be used. Currently my code only supports a static indexing of rows and columns and does not perform presolving operations that eliminate rows or columns. See the function CreateLP() in column gen.c for the relevant section of code. (However, when I modified Tony’s pomdp-solve-5.3 code to make use of lp-solve 5.5, I did enable presolving w.r.t. solving the LPs that prune hyperplanes. See the function LP loadLpSolveLP() in his file lp-interface.c for more details). Eliminating sensors from consideration that are not useful and that slow down Column Generation is one of the two or three most important things that can be done to improve this program. B.6 Simulator Design Fig. B·1 shows how the column gen program’s startup process. The function parseCmd- LineAndCfgFile() is from pomdp-solve-5.3 and is responsible for generating the Pomdp- SolveParams structure that contains all of the parameters for the POMDP solver except for the matrices pomdp P[], pomdp R[] and pomdp Q. After a CSimulator object is created, I call initPomdpSolve() and the .POMDP file parameters are read in from file. My simulation parameters are parsed in the function ReadSimulationData() and the matrices pomdp P[], pomdp R[] and pomdp Q are created at this time (from a set of intermediary/temporary, dense matrices that were created during the execution of initPomdpSolve()), and the linked-list gImmRewardList is created. At this time I create a sensorPayload structure that represents a custom set of solution variables for every subset of sensor visibilities that is of interest. The visibility groups of interest
  • 180.
    158 are determined afterreading in the groups of prior probabilities from the .data file and determining how many distinct classes of visibility these priors are arranged into. As discussed in Ch. 3, prior probabilities for targets and sensor-target visibility is specified in the .data file (the last lines in the file) for example as: • 10 01 0.10 0.20 0.60 0.10 % 10 targets w/ prior π1 can be seen by sensor 0 • 90 11 0.02 0.06 0.12 0.80 % 90 targets w/ prior π2 can jointly be seen by sensors 0+1 In this example π1 = [0.100.200.600.10]T and π2 = [0.020.060.120.80]T and the position of the 0th sensor is on the right-hand side of the bitmask. The total number of targets that these target groups add up to needs to be consistent with the parameters given in the first line of the .data file (NumTargets = NumCellsX * NumCellsY). Once again, until there is motion planning and locality constraints, the physical dimensions of the grid (the layout of the locations) have no meaning. The following is a pseudo-code for the operation of ColumnGen(): • set all Lagrange multipliers to ∞ • call ComputeResourceUsage() to compute classification cost for do-nothing strategy • for i=1 to TotalSensorCount (i=0 is the do-nothing strategy): – set Lagrange multipliers to initialize ith column – call ComputeResourceUsage() to compute costs for strategy i – add pure strategy from solvePomdp() to strategy tree – store objective function coefficients and resource usage data in variables for LP tableau • if initialization was not successful, break out of function • call CreateLP() using variables stored for LP tableau • while Column Generation not converged: – call SolveLP() – store Lagrange multipliers from the solution of the LP – call ComputeResourceUsage() with current Lagrange multiplier vector – add pure strategy from solvePomdp() to strategy tree – store objective function coefficients and resource usage data in variables for LP tableau
  • 181.
    159 main() pSim::CSimulator param = parseCmdLineAndCfgFile(argc,argv) TraceFlags = ... new CSimulator create(param, resourceLevel, eSimMode) 1) parse POMDP file with lex / biso 2) allocate temp mem structures in m_param 3) init gNumStates, gNumActions, gNumObservations, and some other pomdp−solve−5.3 globals initPomdpSolve(m_param) m_grid::CGrid m_cells:CCellList CCell::CCell() initialize m_param, m_origHorizon, m_resourceLevelIndex, eSimMode, m_lambda, m_R, m_strategyToColumnMap, m_lambdaOfStrategy, m_strategyTree, m_policy_graph_group, m_pSolution, m_J_classify_perStrategy, m_J_measure_perStrategy CGrid::CGrid() init(m_param, TaskList, ...) m_bInit = true m_vehicleList::CVehicleList addVehicle(eSearchVehicle, ...) ReadSimulationData(m_param, ...) 1) parse my .data file of simulation param 2) basic param checking 3) allocate my global vars 4) allocate my global sparse arrays for sensor actions that are instantiated from a template defined in POMDP + .data fi 5) redefine pomdp−solve−5.3 global var 6) generate POMDP matrices P, R, Q for each sensor configuration, recreate the immediate reward lists (cached rewards) 7) create TaskList (cell identities, priors) 8) free temporary pomdp−solve−5.3 matrices 9) call initPomdpAlgorithm to create the grid of belief points CVehicle::CVehicle() Figure B·1: Sequence diagram of the startup process in column gen.
  • 182.
    160 – create andappend new column with variables for LP tableau • find the strategies in the final LP tableau with support • call ComputeResourceUsage() for each final pure strategy with support and (this time) record per-task solution data • output solution data to console or to file as determined by the trace flag settings where the function ComputeResourceUsage() has the following form: • for i = 0 to TotalSensorCount-1: – for v = 0 to NumVisibilityGroups-1: ∗ if vth sensorPayload contains ith sensor then update sensor payload’s Q and immRewardList structures for sensor price according to current lambda vector, convert to reward formulation • update master list (pomdp Q and gImmRewardList) for the new sensor prices, convert to reward formulation • for j = 0 to NumVisibilityGroups-1: – get pointer to jth sensorPayload structure – call initPomdpSolveSolution(), per-solver-call code I specialized out of Tony’s init- PomdpSolve() function – call solvePomdp() using POMDP model in jth sensorPayload structure – call convertActions() to convert the results of solvePomdp() which are relative to the action list stored in jth sensorPayload structure, to the global, joint-action list – trace hyperplane info or strategy (policy-graph) (or not) according to trace flag settings – initialize measurement-cost vector J measure[] to 0.0 – initialize scalar variable J classify to 0.0 – for i = 0 to NumTargets-1: ∗ call computeExpectedCosts() · call findInitialHyperplane() to find the best hyperplane (action) for Sub- problem i · call walkPolicyGraph() to break apart classification and measurement costs for Subproblem i · accumulate the measurement cost of Subproblem i in measurement-cost vector J measure[] ∗ accumulate the classification cost of Subproblem i in the scalar variable J classify
  • 183.
    161 ∗ store per-Subproblemsolution info in TaskList[] structure, convert from reward formulation – call cleanUpPomdpSolveSolution() to do per-solver-call cleanup work Fig. B·2 shows how/when the C++ simulator makes use of the ColumnGen() al- gorithm for computing sensor plans. The general algorithm consists of establishing a count-down timer that re-plans when either a) a number of tasks has been done that is equal to some multiple of the number of targets (Subproblems), or b) there are no more outstanding tasks to do. Fig. B·3 gives an overview of the C++ simulator’s oper- ation as it concerns RH Control and the execution of sensing tasks. The algorithm is fairly straight-forward and has relatively low computational complexity (and program- ming complexity). The one issue worthy of discussion is how the simulator maintains an expected resource budget for each sensor. The code keeps track of the sensor re- sources available as well as the expected resource cost for all outstanding (scheduled) tasks. The simulator does not try to schedule any tasks beyond the point where: sensor resources available - (expected resource cost for all scheduled tasks + cost of new task) < 0. Instead, tasks are stored on two separate prioritized lists (queues): a list of tasks that are definitely scheduled to be executed “m taskList” and then a list of tasks that are conditionally executed if resources are left over after executing all the tasks from m taskList. The latter list of potentially executed tasks is called “m delayedTaskList”. delayed =>potential tasks. (If resource utilization is less than expected, extra tasks from m delayedTaskList are scheduled). Tasks are added onto these two lists according to an entropy-gain value that is used to sort/prioritize tasks. A new task with high priority can pop lower value task(s) off of m taskList and push them onto m delayedTaskList. (However, currently I don’t think that one high value task can pop off two lower value ones. . . ). Tasks that can not be scheduled within the expected resource constraints are simply ignored.
  • 184.
    162 main() pSim::CSimulator update() iter−>clearTaskList() optimalMixedCost = computeOptimalPolicy() assignTasksToVehicles() assignTasksToVehiclesMyopically() tasksAssigned? strategyIndex= this−>*strategyChoosingFunction() CTask currentTask(m_grid.cells(x,y), policy_graphs, emptyGraphs, decisionDepth, horizon, initNodeIndex, strategyIndex) success = iter−>addTask(currentTask,...) m_vehicleList::CVehicleList 1) add task to m_taskList, a prioritized list, while not exceeding expected resource budget 2) if not enough expected resources, save task in m_delayedTaskList and try later 3) if new task more important than others and not enough resources to go around then drop the least valuable task if it will make up for the deficienc generate prioritized list of all feasible tasks for all sensors with resources based on entropy gain CTask currentTask(m_grid.cells(x,y), policy_graphs, emptyGraphs, decisionDepth, horizon, initNodeIndex, strategyIndex) success = iter−>addTask(currentTask,...) 1) add task to m_taskList, a prioritized list, while not exceeding expected resource budget 2) if not enough expected resources, save task in m_delayedTaskList and try later 3) if new task more important than others and not enough resources to go around then drop the least valuable task if it will make up for the deficienc reset counters ! time to replan? Figure B·2: Sequence diagram of how a sensor plan is constructed in column gen.
  • 185.
    163 main() ... pSim::CSimulator update() iter−>update() m_vehicleList::CVehicleList ! time toreplan? currentTask::CTask action = currentTask.actionIndex() success = useSensor(currentTask, obs) (obs returned by reference) computeRemainingResourceCost(...) currentCell::CCell CCell& currentCell = cell() state = type() updateMeasureWeight(Pr(y), obs) setPi() 1) get (immediate) resources required for currentTask enough Resources? 5) remove immediate resources used 6) free up expected resource budget for paths not taken in the tree (for follow−up tasks) follow−up tasks? 7) create follow−up task 8) store it in m_followUpTaskList 9) call addTask(followUpTask) 10) remove from m_followUpTaskList if added to m_taskList ok pop currentTask off of m_taskList at end of useSensor unconditionally no yes no yes 1) try to call addTask() with delayed tasks from m_delayedTaskList 2) try to call addTask() with follow−up tasks from m_followUpTaskList (computes expected resource usage up to a certain horizon) ~CTask() x simulationDone = (!checkIfVehiclesHaveResources()) 2) generate random observation 3) find next belief state using Bayes’ Rule 4) find Pr(y | , action )i Figure B·3: Sequence diagram of the update cycle in column gen.
  • 186.
    164 The simulator keepstrack of several different statistics throughout the course of the batches of simulations that are executed upon calling RunSimulationConfiguration() in main.cpp. The average simulation cost, variance of the simulation cost, the average number of unfinished tasks per simulation, the average number of interesting objects per simulation and the average number of unused resources at the end of each simulation are reported via the pointer-return mechanism. The quantity for the average number of unfinished tasks per simulation, “avgUnfininishedTasks”, is defined as: avgUnfinished- Tasks = sizeof(m delayedTaskList at the end of the simulation) / numSimulations. This is basically the average number of tasks that were planned for in the Column Generation code but which were not ever actually executed because of the discrete nature of task resource expenditures, e.g. it is not possible to do 1/100th of a mode2 action. Fig. B·4 - Fig. B·7 provide the interfaces for the main 5 classes used in the C++ sim- ulator: CSimulator, CVehicle, CGrid, CCell and CTask. The CSimulator class contains the heart of the simulation code. The CVehicle class is a container that represents a sensor-platform. The CGrid and CCell classes are fairly trivial. To conclude this appendix, a brief word is in order concerning how tasks generate follow-up tasks. Rather than have each CVehicle object (representing 1+ sensors) deal with planning a whole decision-tree’s worth of tasks in its queue of tasks, the method employed was to create just one task in the queue that could create follow-up tasks as required (depending on the actions specified by the associated decision-tree). Each task is associated with a pure strategy and therefore all follow-up tasks are produced according to child-nodes of the decision-tree (pure strategy) that was used to generate the original task. When tasks are scheduled, they are budgeted not just according to their immediate (deterministic) sensing-resource costs, but according to the expected “down- stream” resource costs for the task and all of its potential child/follow-up tasks. In order to keep track of the down-stream resource costs, some book-keeping was actually done
  • 187.
    165 on the nodesof the decision-tree (PG nodes) while walking the decision-trees for each pure strategy. The variable J measure downstream[] keeps track of this information. The objective was to not just keep track of the immediate deterministic resource costs (which is trivial), or all of the resource costs until the end of the horizon (which is what the walkPolicyGraph() function does), but to keep track of expected resource expenditures over intermediate horizons as well. This allows the simulator to follow a plan that in expectation will execute e.g. two sensing tasks per location when the Column Generation sensing plan was computed with a horizon of 4 sensing actions per location.
  • 188.
    166 CSimulator − CGrid m_grid; −CVehicleList m_vehicleList; − int m_updateCount; − PomdpSolveParams m_param; − ColumnGenParams m_simParams; − int m_origHorizon; − int m_resourceLevelIndex; − SensorPayload * m_sensorPayloadList/*[NumVisibilityGroups]*/; − double *m_lambda/*[TotalSensorCount]*/; − double *m_R/*[TotalSensorCount]*/; − PG ** m_policy_graph_group/*[NumVisibilityGroups][horizon]*/; − PG *** m_strategyTree/*[TotalSensorCount+1][NumVisibilityGroups][horizon]*/; − REAL * m_pSolution/*[TotalSensorCount+1]*/; − double * m_J_classify_perStrategy/*[TotalSensorCount+1]*/; − double ** m_J_measure_perStrategy/*[TotalSensorCount+1][TotalSensorCount]*/; − int *m_strategyToColumnMap/*[TotalSensorCount+1]*/; − double **m_lambdaOfStrategy/*[MAX_COLUMNS+1][TotalSensorCount]*/; − int m_nColumnsInSolution; − bool m_bInit; − bool m_bPlanIsReady; − int m_nActionsBeforeReplanInit; − int m_nActionsBeforeReplan; − int m_nCurrentDecisionDepth; − int assignTasksToVehicles(); − int assignTasksToVehiclesMyopically(); − int chooseStrategyByMixtureWeight(/*const int cellIndex*/) const; − int chooseStrategyByProbability(/*const int cellIndex*/) const; − int chooseStrategyByClassifyCost(/*const int cellIndex*/) const; − int chooseStrategyByMeasureCost(/*const int cellIndex*/) const; − void debugTestStrategyChoosingFunctions(const int numTrials = 1) const; − void getVehicleResourceInfo(); + CSimulator(); + virtual ~CSimulator(); + bool update(); + const CVehicleList::size_type addVehicle(const CVehicle& newVehicle); + void removeVehicle(const int index); + void setSimMode(const Simulator_Mode newMode); + void setActionsBeforeReplanInit(const int numActions); + void create(PomdpSolveParams p, const int resourceLevelIndex, Simulator_Mode eSimMode = eNumSimulatorModes); + void reset(const int resourceLevelIndex = 0, int MDToFARatio = 1, int newHorizon = −1, const Simulator_Mode eSimMode = eNumSimulatorModes, const bool bJustCreated = false); + void destroy(); + bool checkIfVehiclesHaveResources() const; + void computeTotalVehicleResources(double * leftOverResources) const; + unsigned int computeTotalUnfinishedTasks() const + const double simulationCost(const bool bDisplayCellCosts = false) const; Figure B·4: Interface for CSimulator class.
  • 189.
    167 CVehicle − int m_id; −int m_nHorizon; − bool m_bUseResourceConstraints; − CTaskList m_taskList, m_followUpTaskList, m_delayedTaskList; − double * m_Pr_y/*[gNumObservations]*/; − double * m_sensorResources/*[TotalSensorCount]*/; − double * m_initSensorResources/*[TotalSensorCount]*/; − double * m_taskedResources/*[TotalSensorCount]*/; − bool useSensor(CTask& currentTask, int& obs/*, double& remainingResourceCost*/); − const unsigned int processFollowUpTasks(); − const unsigned int processDelayedTasks(); + Vehicle(const Vehicle_Type type, const CPoint loc, const double * sensorResources/*[TotalSensorCount]*/, int horizon, CGrid& grid, const bool bUseResourceConstraints); + virtual ~CVehicle(); + void reset(const bool bUseResourceConstraints, const int newHorizon, const double * sensorResources = NULL); + void resetSensorResources( const double * sensorResources/*[TotalSensorCount]*/ = NULL); + void clearTaskList(); // also resets m_taskedResources + bool addTask(const CTask& task, const bool isDelayedTask); + void printTaskList(const bool bDisplayAll = false) const; + const CTaskList::size_type numTasks() const; + const CTaskList::size_type numFollowUpTasks() const; + const CTaskList::size_type numDelayedTasks() const; + const double sensorResources(const int sensorIndex) const; + const double initSensorResources(const int sensorIndex) const; + const double taskedResources(const int sensorIndex) const; + const double availableResources(const int sensorIndex) const; + const unsigned int update(); Figure B·5: Interface for CVehicle class.
  • 190.
    168 CGrid − int m_nWidth,m_nHeight; − CCellList m_cells; − bool m_bInit; − void createCells(const Target *targetList); + CGrid(); + virtual ~CGrid(); + const int width() const; + const int height() const; + const int numCells() const; + const unsigned int numThreats() const; + const bool isInit(); + void reinit(); + void reset(); + void destroyGrid(); + double calcClassificationCost() const + void init(PomdpSolveParams param, const Target *targetList, const int width, const int height); CCell − int m_nType, m_nCellIndex; − AlphaList m_decisionHyperplanes; − AlphaList m_bestHyperplane; − double m_classifyCost; − double m_value, m_entropy, m_entropyChange; − int m_modeOfMaxChange; − double * m_pi, * m_lastPi, *m_origPi; − void setValue(); − void setEntropy(); + CCell(const int cellIndex, AlphaList decisionHyperplanes, double * pi, const int x, const int y, const int type = NumTargetTypes); + CCell(const CCell& right); const CCell& operator = (const CCell& right); + virtual ~CCell(); + CMyopicMeasurement findActionOfMaxEntropyChange const int sensorIndex, const double resourcesAvailable); + const double * pi() const { return m_pi; } + void setPi(const double * pi, const bool bZeroLastPi = false); + void reinit(const int type = NumTargetTypes); + void resetPi(); + double classifyCost() const; + const AlphaList bestHyperplane() const; + double currentValue() const; + double currentEntropy() const; Figure B·6: Interface for CGrid and CCell classes.
  • 191.
    169 CTask − static intm_classInstanceCount; − int m_id; − int m_actionIndex, m_sensorIndex, m_modeIndex; − PG * m_policyGraphs; − int m_nNumActions, m_nCurrentDepth, m_nHorizon; − int m_nInitNodeIndex; − int m_strategyIndex; − double * m_resourcesForTask/*[TotalSensorCount]*/; − double m_value; − double m_measureWeight; − void setResourcesForTask(const bool reinit = false); − void setTaskValue(); + CTask(CCell& cell, const int sensorIndex, const int modeIndex); + CTask(CCell& cell, PG * policy_graphs, int currentDepth, int numActions, int horizon, int initNodeIndex, int strategyIndex, const double * resourcesForTask = NULL); + virtual ~CTask(); + const int actionIndex() const; + const int sensorIndex() const; + const int modeIndex() const; + const int strategyIndex() const; + const int id() const; + const double value() const; + const int initNodeIndex() const; + const int numActions() const; + const int currentDepth() const; + const int horizon() const; + double measureWeight() const; + double resourcesForTask(const int sensorIndex); + double resourcesForTask() const; + const double * resourcesForTaskPtr() const; + void setMeasureWeight(const double weight); + int nextNodeIndex(int obs) const; + CTask createFollowUpTask(int obs, const double * remainingResourceCost/*[TotalSensorCount]*/); + void computeRemainingResourceCost( double * remainingResourceCost/*[TotalSensorCount]*/, int obs) const; + bool operator < (const CTask& right) const; + CCell& cell() const; + static const int classInstanceCount(); + static void resetClassInstanceCount(); Figure B·7: Interface for CTask class.
  • 192.
    References Anderson, J. D.(2006). Methods and metrics for human control of multi-robot teams. Master’s thesis, Brigham Young University. Athans, M. (1972). On the determination of optimal costly measurement strategies for linear stochastic systems. Automatica, 8(4):397–412. Baillieul, J. and Baronov, D. (2010). Information acquisition in the exploration of random fields. In Hu, X. and Ghosh, B., editors, Three Decades of Progress in Control. Springer. Baker, M. and Yanco, H. (2004). Autonomy mode suggestions for improving human- robot interaction. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics, volume 3, pages 2948–2953. Baronov, D. and Baillieul, J. (2010). Topology guided search of potential fields. preprint (2010). Bashan, E., Raich, R., and Hero, A. (2007). Adaptive sampling: Efficient search schemes under resource constraints. www.eecs.umich.edu/~bashan/cspl-385.pdf. Bashan, E., Raich, R., and Hero, A. (2008). Optimal two-stage search for sparse targets using convex criteria. IEEE Transactions on Signal Processing, 56(11):5389–5402. Bellman, R. E. (1957). Dynamic programming. Princeton University Press. Benkoski, S. J., Monticino, M. G., and Weisinger, J. R. (1991). A survey of the search theory literature. Naval Research Logistics, 38(4):469–494. Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 1–2. Athena Scientific, 3 edition. Bertsimas, D. and Tsitsiklis, J. (1997). Introduction to Linear Optimization. Athena Scientific. Bruemmer, D. J. and Walton, M. C. (2003). Collaborative tools for mixed teams of humans and robots. In Proceedings of the Workshop on Multi-Robot Systems, pages 219–229. 170
  • 193.
    171 Cassandra, A. R.(1999). Tony’s pomdp-solve page. http://www.cassandra.org/ pomdp/code/index.shtml. Castanon, D. and Wohletz, J. (2009). Model predictive control for stochastic resource allocation. IEEE Transactions on Automatic Control, 54(8):1739–1750. Casta˜n´on, D. A. (1995). Optimal search strategies in dynamic hypothesis testing. IEEE Transactions on Systems, Man and Cybernetics, 25(7):1130–1138. Casta˜n´on, D. A. (1997). Approximate dynamic programming for sensor management. In Proceedings of the 36th IEEE Conference on Decision and Control, pages 1202– 1207. Casta˜n´on, D. A. (2005a). A lower bound on adaptive sensor management performance for classification. preprint (2005). Casta˜n´on, D. A. (2005b). Stochastic control bounds on sensor network performance. In Proceedings of the 44th IEEE Conference on Decision and Control, pages 4939–4944. Casta˜n´on, D. A. and Carin, L. (2008). Stochastic control theory for sensor management. In Hero, A., Casta˜n´on, D., Cochran, D., and Kastella, K., editors, Foundations and Applications of Sensor Management. Springer Verlag, New York, NY. Casta˜n´on, D. A. and Wohletz, J. (2002). Model predictive control for dynamic unreliable resource allocation. In Proceedings of the 41st IEEE Conference on Decision and Control, volume 4, pages 3754–3759. Chernovv, H. (1972). Sequential Analysis and Optimal Design. Society for Industrial and Applied Mathematics. Chong, E., Kreucher, C., and Hero, A. (2008a). Monte-carlo-based partially observable Markov decision process approximations for adaptive sensing. International Work- shop on Discrete Event Systems, pages 173–180. Chong, E., Kreucher, C., and Hero, A. (2008b). Partially observable Markov decision process approximations for adaptive sensing. preprint (2008). Crandall, J. and Cummings, M. (2008). A predictive model for human-unmanned vehi- cle systems. Technical report, MIT Humans and Automation Laboratory, Cambridge, MA. Cummings, M. and Mitchell, P. (2008). Predicting controller capacity in supervisory control of multiple UAVs. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 38(2):451–460.
  • 194.
    172 Cummings, M. L.,Mitchell, P. M., and Sheridan, T. B. (2005). Human supervisory control challenges in network centric operations. Technical report, In Human Systems Information Analysis Center (HSIAC) (Ed.), State of the Art Report. Dayton, OH: AFRL. Cummings, M. L. and Morales, D. (2005). UAVs as tactical wingmen: Control methods and pilots’ perceptions. Unmanned Systems, 23(1):25–27. Dantzig, G. B. and Wolfe, P. (1961). The decomposition algorithm for linear programs. Econometrica, 29(4):767–778. Dudenhoeffer, D. D. (2001). Command and control architectures for autonomous micro- robotic forces - FY-2000 project report. Technical report, Idaho National Laboratory. Dudenhoeffer, D. D., Bruemmer, D. J., and Davis, M. L. (2001). Modeling and simula- tion for exploring human-robot team interaction requirements. In Proceedings of the 33nd conference on Winter simulation, pages 730–739, Washington, DC, USA. IEEE Computer Society. Endsley, M. R. (1988). Design and evaluation for situation awareness enhancement. In Proceedings of the Human Factors Society 32nd Annual Meeting, volume 1, pages 97–1010. Human Factors Society. Fedorov, V. V. (1972). Theory of Optimal Experiments. Academic Press, New York. Freedy, A., DeVisser, E., Weltman, G., and Coeyman, N. (2007). Measurement of trust in human-robot collaboration. In International Symposium on Collaborative Technologies and Systems 2007, pages 106–114. Gerkey, B. P. and Matari´c, M. J. (2004). A formal analysis and taxonomy of task allocation in multi-robot systems. The International Journal of Robotics Research, 23(9):939–954. Gilmore, P. C. and Gomory, R. E. (1961). A linear programming approach to the cutting-stock problem. Operations Research, 9(6):849–859. Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B, 41(2):148–177. Goodrich, M. A., Jr., D. R. O., Crandall, J. W., and Palmer, T. J. (2001). Experi- ments in adjustable autonomy. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Autonomy, Delegation and Control: Interacting with Intelligent Agents, pages 1624–1629. Grocholsky, B. (2002). Information-Theoretic Control of Multiple Sensor Platforms. PhD thesis, University of Sydney.
  • 195.
    173 Grocholsky, B., Makarenko,A., and Durrant-Whyte, H. (2003). Information-theoretic coordinated control of multiple sensor platforms. In Proceedings of the IEEE Inter- national Conference on Robotics and Automation, pages 1521–1526. Hitchings, D. C. and Casta˜n´on, D. A. (2010). Receding horizon stochastic control algo- rithms for sensor management. In Proceedings of the American Control Conference, Baltimore, MD. Jenkins, K. (2010). Adaptive Sensor Management for Feature-Based Classification. PhD thesis, Boston University. Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134. Kastella, K. (1996). Discrimination gain for sensor management in multitarget detection and tracking. In Computational Engineering in Systems Application (CESA) 1996: Proceedings of the IEEE-Systems, Man, and Cybernetics (SMC) and International Association for Mathematics and Computers in Simulation (IMACS) Multiconference, volume 1, pages 167–172. Kastella, K. (1997). Discrimination gain to optimize detection and classification. IEEE Transactions on Systems, Man and Cybernetics, Part A, 27(1):112–116. Kaupp, T. and Makarenko, A. (2008). Measuring human-robot team effectiveness to determine an appropriate autonomy level. In Proceedings of the 2008 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 2146–2151. Kiefer, J. C. (1959). Optimum experimental designs. Journal of the Royal Statistical Society Series B, 21:272–319. Koopman, B. (1980). Search and Screening: General Principles with Historical Appli- cations. Pergamon, New York NY. Koopman, B. O. (1946). Search and Screening. Operations Evaluation Group Report No. 56. Technical report, Center for Naval Analyses, Alexandria, VA. Kreucher, C. and Hero, A. (2006). Monte carlo methods for sensor management in target tracking. In Proceedings of the IEEE Nonlinear Statistical Signal Processing Workshop. Kreucher, C., Kastella, K., and Alfred O. Hero, I. (2005). Sensor management using an active sensing approach. Signal Processing, 85(3):607–624. Krishnamurthy, V. and Evans, J. (2001a). Optimal sensor scheduling for hidden Markov model state estimation. International Journal of Control, 74(18):1737–1742.
  • 196.
    174 Krishnamurthy, V. andEvans, R. (2001b). Hidden Markov model multiarm bandits: a methodology for beam scheduling in multitarget tracking. IEEE Transactions on Signal Processing, 49(12):2893–2908. Leung, J. Y.-T. (2004). Handbook of scheduling: algorithms, models, and performance analysis, volume 1. CRC Press. Chapman & Hall/CRC computer and information science series. Lindley, D. V. (1956). On a measure of the information provided by an experiment. Annals of Mathematical Statistics, 27:986–1005. Littman, M. L. (1994). The Witness algorithm: Solving partially observable Markov decision processes. Technical report, Brown University, Department of Computer Science, Providence, RI. Lovejoy, W. S. (1991a). Computationally feasible bounds for partially observed Markov decision processes. Operations Research, 39(1):162. Lovejoy, W. S. (1991b). A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28(1-4):47–66. McMahan, H. B. (2006). Robust Planning in Domains with Stochastic Outcomes, Ad- versaries, and Partial Observability. PhD thesis, Carnegie Mellon University. Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28(1):1–16. Nehme, C. E. and Cummings, M. L. (2007). An analysis of heterogeneity in futuristic unmanned vehicle systems. Technical report, MIT Dspace. Patrascu, R.-E. (2004). Linear Approximations For Factored Markov Decision Pro- cesses. PhD thesis, University of Waterloo. Pineau, J., Gordon, G., and Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1025–1032. Raghunathan, D. and Baillieul, J. (2010). Search decisions in a game of polynomial root counting. preprint (2010). Rangarajan, R., Raich, R., and Hero, A. (2007). Optimal sequential energy allocation for inverse problems. IEEE Journal on Selected Topics in Signal Processing, 1(1):67– 78.
  • 197.
    175 Schermerhorn, P. andScheutz, M. (2009). Dynamic robot autonomy: Investigating the effects of robot decision-making in a human-robot team task. In Proceedings of the 2009 International Conference on Multimodal Interfaces, pages 63–70, New York, NY, USA. ACM. Schmaedeke, W. W. (1993). Information-based sensor management. In Kadar, I. and Libby, V., editors, Proceedings of Signal Processing, Sensor Fusion, and Target Recognition II, volume 1955, pages 156–164. SPIE. Schmaedeke, W. W. and Kastella, K. D. (1994). Event-averaged maximum likelihood estimation and information-based sensor management. In Kadar, I. and Libby, V., editors, Proceedings of Signal Processing, Sensor Fusion, and Target Recognition III, volume 2232, pages 91–96. SPIE. Schneider, M., Mealy, G., and Pait, F. (2004). Closing the loop in sensor fusion systems: stochastic dynamic programming approaches. In Proceedings of the American Control Conference, volume 5, pages 4752–4757. Scholtz, J. C. (2002). Human-Robot Interactions: Creating synergistic cyber forces, pages 177–184. Kluwer Academic Publishers. Sellers, D. (1996). A survey of approaches to the job shop scheduling problem. In Proceedings of the Twenty-Eighth Southeastern Symposium on System Theory, pages 396–400. Smallwood, R. D. and Sondik, E. J. (1973). The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21(5):1071–1088. Steinfeld, A., Fong, T., Kaber, D. B., Lewis, M., Scholtz, J., Schultz, A. C., and Goodrich, M. A. (2006). Common metrics for human-robot interaction. In Hu- man Robot Interaction, pages 33–40. Stone, L. D. (1975). Theory of Optimal Search. Academic Press. Stone, L. D. (1977). Search theory: A mathematical theory for finding lost objects. Mathematics Magazine, 50(5):248–256. Tebboth, J. R. (2001). A computational study of Dantzig-Wolfe decomposition. PhD thesis, University of Buckingham. United States Department of Defense (2009). Unmanned systems integrated roadmap 2009-2034. Technical report, Office of the Secretary of Defense, U.S.A. www.acq. osd.mil/uas/docs/UMSIntegratedRoadmap2009.pdf. Wald, A. (1943). On the efficient design of statistical investigations. The Annals of Mathematical Statistics, 14:134–140.
  • 198.
    176 Wald, A. (1945).Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186. Washburn, R., Schneider, M., and Fox, J. (2002). Stochastic dynamic programming based approaches to sensor resource management. In Proceedings of the 5th Interna- tional Conference Information Fusion, volume 1, pages 608–615. Williams, J., Fisher, J., and Willsky, A. (2005). An approximate dynamic program- ming approach for communication constrained inference. In Proceedings of the IEEE Workshop on Statistical Signal Processing, pages 1202–1207. Williams, J. L. (2007). Information Theoretic Sensor Management. PhD thesis, Mas- sachusetts Institute of Technology. Wintenby, J. and Krishnamurthy, V. (2006). Hierarchical resource management in adaptive airborne surveillance radars. IEEE Transactions on Aerospace and Elec- tronic Systems, 42(2):401–420. Wong, E.-M., Bourgault, F., and Furukawa, T. (2005). Multi-vehicle Bayesian search for multiple lost targets. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pages 3169–3174. Yost, K. A. and Washburn, A. R. (2000). The LP/POMDP marriage: Optimization with imperfect information. Naval Research Logistics, 47(8):607–619.