Efficient Model-based 3D Tracking by Using Direct Image Registration

UNIVERSIDAD POLITÉCNICA DE MADRID
FACULTAD DE INFORMÁTICA
TESIS DOCTORAL
Efficient Model-based 3D Tracking by Using Direct Image
Registration
presentada en la
FACULTAD DE INFORMÁTICA
de la
UNIVERSIDAD POLITÉCNICA DE MADRID
para la obtención del
GRADO DE DOCTOR EN INFORMÁTICA
AUTOR: Enrique Muñoz Corral
DIRECTOR: Luis Baumela Molina
Madrid, 2012

Agradecimientos
La verdad es que los diez años (diez!) que he tardado en escribir esta tesis dan
para muchas cosas, y si tuviera que agradecer algo a todas las personas que me
han ayudado, necesitar´ıa un cap´ıtulo entero. En primer lugar quisiera agradecer
a Luis Baumela, gran director de tesis y mejor persona, el haber despertado en
m´ı el gusanillo por la investigación, y sobre todo, por tener la suficiente paciencia
para aguantar mis cabezonadas. Luis, si no fuera por t´ı, no habr´ıa entrado en la
Universidad y estar´ıa en la empresa privada ganando una pasta gansa—yeah, thank
you so much!
Gracias mil a Javier de Lope, por incansables discusiones técnicas y no tan
técnicas y sobre todo a José Miguel Buenaposada, quien durante todos estos años
me ha aguantado, ayudado, irritado, bromeado, e incluso buscado trabajo. No me
puedo olvidar de los buenos ratos pasados en la hora de la comida junto con las
“chicas” de estad´ıstica (Maribel, Arminda, Concha y Juan Antonio), en las que han
aguantado mis interminables peroratas sobre la burbuja inmobiliaria y los pol´ıticos
patrios. Un recuerdo también para todos los compañeros que han pasado por el
laboratorio L-3202 durante estos años: los “chicos de Javi” (Javi, Juan, Bea y
Yadira), Juan Bekios, los dos “Pablos” (Márquez y Herrero), Antonio y Rubén.
Quisiera agradecer también a Lourdes Agapito por permitirme participar en el
proyecto Automated facial expression analysis using computer vision, financiado por
la Royal Society del Reino Unido. Gracias a este proyecto pude tener el privilegio
de trabajar con Lourdes y con Xavier Lladó, y sobre todo de conocer a ese singular
personaje llamado Alessio del Bue. No tengo palabras para agradecer a Alessio el
ser tan majete y el aguantar estoicamente tantas veces como le hemos gorroneado.
Tampoco puedo olvidarme de la ayuda prestada por el profesor Thomas Vetter y su
grupo de la Universidad de Basilea (especialmente Brian Amberg y Pascal Paysan);
ellos se tomaron la molestia de construir un modelo tridimensional de mi cara,
incluyendo deformaciones y expresiones. No quisiera cerrar estos agradecimientos
sin comentar que parte de los trabajos de esta tesis se han realizado bajo los proyectos
del Ministerio de Ciencia y Tecnolog´ıa TIC2002-00591, y del Ministerio de Ciencia
e Innovación TIN2008-06815-C02-02.
Y por último, aunque no por ello menos importante, agradecer a Susana la
paciencia que ha tenido todos estos años (que han sido muchos) en los que he estado
liado con la tesis. Va por t´ı, Susana!
Enero de 2012
iii

Contents
Resumen xvii
Summary xix
Notations 1
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Review 13
2.1 Image Registration vs. Tracking . . . . . . . . . . . . . . . . . . . . . 13
2.2 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Model-based 3D Tracking . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Modelling assumptions . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Rigid Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Nonrigid Objects . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Facial Motion Capture . . . . . . . . . . . . . . . . . . . . . . 18
3 Eﬃcient Direct Image Registration 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Imaging Geometry . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Brightness Constancy Constraint . . . . . . . . . . . . . . . . 23
3.2.3 Image Registration by Optimization . . . . . . . . . . . . . . . 23
3.2.4 Additive vs. Compositional . . . . . . . . . . . . . . . . . . . 25
3.3 Additive approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Lucas-Kanade Algorithm . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Hager-Belhumeur Factorization Algorithm . . . . . . . . . . . 29
3.4 Compositional approaches . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Forward Compositional Algorithm . . . . . . . . . . . . . . . . 33
3.4.2 Inverse Compositional Algorithm . . . . . . . . . . . . . . . . 35
3.5 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v

4 Equivalence of Gradients 39
4.1 Image Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Image Gradients in R2
. . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Image Gradients in P2
. . . . . . . . . . . . . . . . . . . . . . 42
. . . . . . . . . . . . . . . . . . . . . . 43
4.2 The Gradient Equivalence Equation . . . . . . . . . . . . . . . . . . . 45
4.2.1 Relevance of the Gradient Equivalence Equation . . . . . . . . 46
4.2.2 General Approach to Gradient Replacement . . . . . . . . . . 46
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Additive Algorithms 51
5.1 Gradient Replacement Requirements . . . . . . . . . . . . . . . . . . 52
5.2 Systematic Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 3D Rigid Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 3D Textured Models . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Shape-induced Homography . . . . . . . . . . . . . . . . . . . 57
5.3.3 Change to the Reference Frame . . . . . . . . . . . . . . . . . 57
5.3.4 Optimization Outline . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.5 Gradient Replacement . . . . . . . . . . . . . . . . . . . . . . 61
5.3.6 Systematic Factorization . . . . . . . . . . . . . . . . . . . . . 63
5.4 3D Nonrigid Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Nonrigid Morphable Models . . . . . . . . . . . . . . . . . . . 65
5.4.2 Nonrigid Shape-induced Homography . . . . . . . . . . . . . . 65
5.4.3 Change of Variables to the Reference Frame . . . . . . . . . . 66
5.4.4 Optimization Outline . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.5 Gradient Replacement . . . . . . . . . . . . . . . . . . . . . . 69
5.4.6 Systematic Factorization . . . . . . . . . . . . . . . . . . . . . 71
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Compositional Algorithms 77
6.1 Unravelling the Inverse Compositional Algorithm . . . . . . . . . . . 77
6.1.1 Change of Variables in IC . . . . . . . . . . . . . . . . . . . . 79
6.1.2 The Eﬃcient Forward Compositional Algorithm . . . . . . . . 79
6.1.3 Rationale of the Change of Variables in IC . . . . . . . . . . . 82
6.1.4 Diﬀerences between IC and EFC . . . . . . . . . . . . . . . . . 84
6.2 Requirements for Compositional Warps . . . . . . . . . . . . . . . . . 85
6.2.1 Requirement on Warp Composition . . . . . . . . . . . . . . . 85
6.2.2 Requirement on Gradient Equivalence . . . . . . . . . . . . . 85
6.3 Other Compositional Algorithms . . . . . . . . . . . . . . . . . . . . 86
6.3.1 Generalized Inverse Compositional Algorithm . . . . . . . . . 86
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vi

7 Computational Complexity 91
7.1 Complexity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1.1 Number of Operations . . . . . . . . . . . . . . . . . . . . . . 91
7.1.2 Complexity of Matrix Operations . . . . . . . . . . . . . . . . 92
7.1.3 Comparing Algorithm Complexities . . . . . . . . . . . . . . . 93
7.2 Algorithm Naming Conventions . . . . . . . . . . . . . . . . . . . . . 94
7.2.1 Additive Algorithms . . . . . . . . . . . . . . . . . . . . . . . 95
7.2.2 Compositional Algorithms . . . . . . . . . . . . . . . . . . . . 96
7.3 Complexity of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.1 Additive Algorithms . . . . . . . . . . . . . . . . . . . . . . . 97
7.3.2 Compositional Algorithms . . . . . . . . . . . . . . . . . . . . 103
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8 Experiments 107
8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2 Features and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.1 Numerical Ranges for Features . . . . . . . . . . . . . . . . . . 115
8.3 Generation of Synthetic Experiments . . . . . . . . . . . . . . . . . . 116
8.3.1 Synthetic Datasets and Images . . . . . . . . . . . . . . . . . 118
8.3.2 Generation of Result Plots . . . . . . . . . . . . . . . . . . . . 120
8.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.4.1 Convergence Criteria . . . . . . . . . . . . . . . . . . . . . . . 122
8.4.2 Visibility Management . . . . . . . . . . . . . . . . . . . . . . 122
8.4.3 Scale of Homographies . . . . . . . . . . . . . . . . . . . . . . 125
8.4.4 Minimization of Jacobian Operations . . . . . . . . . . . . . . 126
8.5 Additive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.5.1 Experimental Hypotheses . . . . . . . . . . . . . . . . . . . . 126
8.5.2 Experiments with Synthetic Rigid data . . . . . . . . . . . . . 127
8.5.3 Experiments with Synthetic Nonrigid data . . . . . . . . . . . 142
8.5.4 Experiments With Nonrigid Sequence . . . . . . . . . . . . . . 151
8.5.5 Experiments with real Rigid data . . . . . . . . . . . . . . . . 154
8.5.6 Experiment with real Nonrigid data . . . . . . . . . . . . . . . 158
8.6 Compositional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 163
8.6.1 Experimental Hyphoteses . . . . . . . . . . . . . . . . . . . . 163
8.6.2 Experiments with Synthetic Rigid data . . . . . . . . . . . . . 163
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9 Conclusions and Future work 179
9.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 179
9.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
A Gauss-Newton Optimization 201
B Plane-induced Homography 203
vii

C Plane+Parallax-constrained Homography 205
C.1 Compositional Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
D Methodical Factorization 209
D.1 Basic Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
D.2 Lemmas that Re-organize Product of Matrices . . . . . . . . . . . . . 211
D.3 Lemmas that Re-organize Kronecker Products . . . . . . . . . . . . . 215
D.4 Lemmas that Re-organize Sums of Matrices . . . . . . . . . . . . . . 216
E Methodical Factorization of f3DTM 219
F Methodical Factorization of f3DMM (Partial case) 223
G Methodical Factorization of f3DMM (Full case) 225
H Detailed Complexity of Algorithms 235
H.1 Warp f3DTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
H.2 Warp f3DMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
H.3 Jacobian of Algorithm HB3DTM . . . . . . . . . . . . . . . . . . . . 237
H.4 Jacobian of Algorithm HB3DTMNF . . . . . . . . . . . . . . . . . . 239
H.5 Jacobian of Algorithm HB3DMMNF . . . . . . . . . . . . . . . . . 241
H.6 Jacobian of Algorithm HB3DMMSF . . . . . . . . . . . . . . . . . . 246
viii

List of Figures
1.1 Example of 3D rigid tracking. . . . . . . . . . . . . . . . . . . . . 6
1.2 3D Nonrigid Tracking. . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Image registration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Industrial applications of 3D tracking. . . . . . . . . . . . . . . 9
1.5 Motion capture in the film industry. . . . . . . . . . . . . . . . 10
1.6 Markerless facial motion capture. . . . . . . . . . . . . . . . . . 11
3.1 Imaging geometry. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Iterative gradient descent image registration. . . . . . . . . . . 24
3.3 Generic descent method for image registration. . . . . . . . . . 26
3.4 Lucas-Kanade image registration. . . . . . . . . . . . . . . . . . 28
3.5 Hager-Belhumeur image registration. . . . . . . . . . . . . . . . 32
3.6 Forward compositional image registration. . . . . . . . . . . . . 34
3.7 Inverse compositional image registration. . . . . . . . . . . . . 36
4.1 Depiction of Image Gradients. . . . . . . . . . . . . . . . . . . . 41
4.2 Image Gradient in P2
. . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Image gradient in R3
. . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Comparison between BCC and GEE. . . . . . . . . . . . . . . . 47
4.5 Gradients and Convergence. . . . . . . . . . . . . . . . . . . . . . 49
4.6 Open Subsets in Various Domains. . . . . . . . . . . . . . . . . . 49
5.1 3D Textured Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Shape-induced homographies. . . . . . . . . . . . . . . . . . . . . 58
5.3 Warp defined on the reference frame. . . . . . . . . . . . . . . . 59
5.4 Reference frame advantages. . . . . . . . . . . . . . . . . . . . . . 60
5.5 Nonrigid Morphable Models. . . . . . . . . . . . . . . . . . . . . 65
5.6 Nonrigid shape-induced homographies. . . . . . . . . . . . . . . 67
5.7 Deformable warp defined on the reference frame. . . . . . . . 68
6.1 Change of variables in IC. . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Forward compositional image registration. . . . . . . . . . . . . 83
6.3 Generalized inverse compositional image registration. . . . . . 88
7.1 Complexity of Additive Algorithms. . . . . . . . . . . . . . . . . 102
7.2 Complexities of Compositional Algorithms . . . . . . . . . . . 105
ix

8.1 Registration vs. Tracking. . . . . . . . . . . . . . . . . . . . . . . 109
8.2 Algorithm initialization . . . . . . . . . . . . . . . . . . . . . . . . 110
8.3 Accuracy and convergence. . . . . . . . . . . . . . . . . . . . . . 114
8.4 Ground Truth and Noise Variance. . . . . . . . . . . . . . . . . 117
8.5 Deﬁnition of Datasets. . . . . . . . . . . . . . . . . . . . . . . . . 118
8.6 Example of Synthetic Datasets. . . . . . . . . . . . . . . . . . . . 119
8.7 Experimental Evaluation with Synthetic Data . . . . . . . . . 121
8.8 Visibility management. . . . . . . . . . . . . . . . . . . . . . . . . 123
8.9 Eﬃciently solving of WLS. . . . . . . . . . . . . . . . . . . . . . . 125
8.10 The cube model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.11 The face model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.12 The tea box model. . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.13 Results from dataset DS1 for cube. . . . . . . . . . . . . . . . . . 130
8.19 tea box sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.20 Results for the tea box sequence. . . . . . . . . . . . . . . . . . . 137
8.21 Estimated parameters from teabox sequence. . . . . . . . . . . 138
8.22 Estimated parameters from face sequence. . . . . . . . . . . . . 140
8.23 Good texture vs. bad texture. . . . . . . . . . . . . . . . . . . . 141
8.24 The face-deform model. . . . . . . . . . . . . . . . . . . . . . . . . 142
8.25 Distribution of Synthetic Datasets. . . . . . . . . . . . . . . . . 143
8.26 Results from dataset DS1 for face-deform. . . . . . . . . . . . . 145
8.32 face-deform sequence. . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.33 Results from face-deform sequence. . . . . . . . . . . . . . . . . 152
8.34 Estimated parameters from face-deform sequence. . . . . . . . 153
8.35 The cube-real model. . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.36 The cube-real sequence. . . . . . . . . . . . . . . . . . . . . . . . 156
8.37 Results from cube-real sequence. . . . . . . . . . . . . . . . . . . 157
8.38 Selected facial scans used to build the model. . . . . . . . . . . 158
8.39 Unfolded texture model. . . . . . . . . . . . . . . . . . . . . . . . 159
8.40 The face-real sequence. . . . . . . . . . . . . . . . . . . . . . . . 160
8.41 Anchor points in the model. . . . . . . . . . . . . . . . . . . . . . 161
8.42 Results for the face-real sequence. . . . . . . . . . . . . . . . . 162
8.43 The plane model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.44 Distribution of Synthetic Datasets. . . . . . . . . . . . . . . . . 165
x

8.45 Results from dataset DS1 for plane. . . . . . . . . . . . . . . . . 167
8.51 Average Time per iteration. . . . . . . . . . . . . . . . . . . . . . 176
9.1 Spiderweb Plots for Image Registration Algorithms. . . . . . 182
9.2 Spherical Harmonics-based Illumination Model . . . . . . . . . 184
9.3 Tracking by simultaneously using texture and edges infor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.4 Eﬃcient tracking using multiple views . . . . . . . . . . . . . . 186
B.1 Plane-induced homography. . . . . . . . . . . . . . . . . . . . . . 203
C.1 Plane+Parallax-constrained homograpy. . . . . . . . . . . . . . 206
xi

List of Tables
4.1 Characteristics of the warps . . . . . . . . . . . . . . . . . . . . . 50
6.1 Relationship between compositional algorithms and warps . . 89
6.2 Requirements for Optimization Algorithms . . . . . . . . . . . 90
7.1 Complexity of matrix operations. . . . . . . . . . . . . . . . . . 93
7.2 Additive testing algorithms. . . . . . . . . . . . . . . . . . . . . . 95
7.3 Additive testing algorithms. . . . . . . . . . . . . . . . . . . . . . 96
7.4 Complexity of Algorithm LK3DTM. . . . . . . . . . . . . . . . . 97
7.5 Complexity of Algorithm HB3DTM. . . . . . . . . . . . . . . . 98
7.6 Complexity of Algorithm LK3DMM. . . . . . . . . . . . . . . . 98
7.7 Complexity of Algorithm HB3DMMNF. . . . . . . . . . . . . . 99
7.8 Complexity of Algorithm HB3DMM. . . . . . . . . . . . . . . . 100
7.9 Complexity of Algorithm HB3DMMSF. . . . . . . . . . . . . . 101
7.10 Complexities of Additive Algorithms. . . . . . . . . . . . . . . . 101
7.11 Complexity of Algorithm LKH8. . . . . . . . . . . . . . . . . . . 103
7.12 Complexity of Algorithm ICH8. . . . . . . . . . . . . . . . . . . 103
7.13 Complexity of Algorithm HBH8. . . . . . . . . . . . . . . . . . . 104
7.14 Complexity of Algorithm GICH8. . . . . . . . . . . . . . . . . . 104
7.15 Complexities of Compositional Algorithms. . . . . . . . . . . . 106
7.16 Comparison of Relative Complexities for Additive Algorithms106
7.17 Comparison of Relative Complexities for Compositional Al-
gorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.1 Registration vs. tracking in eﬃcient methods . . . . . . . . . . 111
8.2 Features and Measures. . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3 Numerical Ranges for Features. . . . . . . . . . . . . . . . . . . . 115
8.4 Evaluated Additive Algorithms . . . . . . . . . . . . . . . . . . . 127
8.5 Ranges of parameters for cube experiments. . . . . . . . . . . . 129
8.6 Average reprojection error vs. noise for cube. . . . . . . . . . . 129
8.7 Ranges of parameters for face-deform experiments. . . . . . . 144
8.8 Average reprojection error vs. noise for face-deform. . . . . . 144
8.9 Evaluated Compositional Algorithms . . . . . . . . . . . . . . . 164
8.10 Ranges of motion parameters for each dataset. . . . . . . . . . 165
8.11 Average reprojection error vs. noise for plane. . . . . . . . . . 166
xiii

9.1 Classiﬁcation of Motion Warps. . . . . . . . . . . . . . . . . . . . 181
D.1 Lemmas used to re-arrange matrices product. . . . . . . . . . 214
D.2 Lemmas used to re-arrange Kronecker matrix products. . . . 216
xiv

List of Algorithms
1 Outline of the basic GN-based descent method for image
registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Outline of the Lucas-Kanade algorithm. . . . . . . . . . . . . . 28
3 Outline of the Hager-Belhumeur algorithm. . . . . . . . . . . . 31
4 Outline of the Forward Compositional algorithm. . . . . . . . 34
5 Outline of the Inverse Compositional algorithm. . . . . . . . . 36
6 Iterative factorization of the Jacobian matrix. . . . . . . . . . 54
7 Outline of the HB3DTM algorithm. . . . . . . . . . . . . . . . . 64
8 Outline of the full-factorized HB3DMM algorithm. . . . . . . 75
9 Outline of the HB3DMMSF algorithm. . . . . . . . . . . . . . . 76
10 Outline of the Eﬃcient Forward Compositional algorithm. . . 82
11 Outline of the Generalized Inverse Compositional algorithm. 88
12 Creating the synthetic datasets. . . . . . . . . . . . . . . . . . . 119
13 Outline of the GN algorithm. . . . . . . . . . . . . . . . . . . . . 202
xv

Resumen
Esta tesis trata el problema de seguimiento eficiente de objectos 3D en secuencias de
imágenes. Tratamos el problema del seguimiento 3D usando registrado de imágenes
directo, una técnica que permite alinear dos imágenes usando sus niveles de inten-
sidad. El registrado de imágenes se suele resolver usando métodos de optimización
iterativa, donde la función a minimizar depende del error en los niveles de intensidad.
En esta tesis examinaremos los métodos de registrado de imágenes más comunes,
haciendo hincapié en aquellos que usan algoritmos eficientes de optimización.
En esta tesis investigaremos dos formas de registrado eficiente. La primera in-
cluye a los métodos aditivos de registrado: los parámetros de movimiento se calculan
incrementalmente mediante una aproximación lineal de la función de error. Dentro
de este tipo de algoritmos, nos centraremos en el método de factorización de Hager y
Belhumeur. Introduciremos un requisito necesario que el algoritmo de factorización
debe cumplir para tener una buena convergencia. Además, proponemos un pro-
cedimiento automático de factorización que nos permitirá seguir objetos 3D tanto
r´ıgidos como deformables.
El segundo tipo son los llamados métodos composicionales de registrado, donde
la norma de error se reescribe usando composición de funciones. Estudiaremos los
métodos composicionales más usuales, haciendo hincapié en el método de registrado
más rápido, el algoritmo composicional inverso. Introduciremos un nuevo método
de registrado composicional, el algoritmo Efficient Forward Compositional, que nos
permite interpretar los mecanismos de funcionamiento del algoritmo composicional
inverso. Gracias a esta interpretación novedosa, enunciaremos dos requisitos funda-
mentales para algoritmos composicionales eficientes.
Por último, realizaremos una serie de experimentos con datos reales y sintéticos
para comprobar los postulados teóricos. Además, diferenciaremos entre los proble-
mas de registrado y seguimiento para algoritmos eficientes: aquellos algoritmos que
cumplan su(s) requisito(s) podrán usarse para registrado de imágenes, pero no para
seguimiento.
xvii

Abstract
This thesis deals with the problem of efficiently tracking 3D objects in sequences of
images. We tackle the efficient 3D tracking problem by using direct image registra-
tion. This problem is posed as an iterative optimization procedure that minimizes
a brightness error norm. We review the most popular iterative methods for image
registration in the literature, turning our attention to those algorithms that use
efficient optimization techniques.
Two forms of efficient registration algorithms are investigated. The first type
comprises the additive registration algorithms: these algorithms incrementally com-
pute the motion parameters by linearly approximating the brightness error function.
We centre our attention on Hager and Belhumeur’s factorization-based algorithm for
image registration. We propose a fundamental requirement that factorization-based
algorithms must satisfy to guarantee good convergence, and introduce a systematic
procedure that automatically computes the factorization. Finally, we also bring
out two warp functions to register rigid and nonrigid 3D targets that satisfy the
requirement.
The second type comprises the compositional registration algorithms, where the
brightness function error is written by using function composition. We study the
current approaches to compositional image alignment, and we emphasize the impor-
tance of the Inverse Compositional method, which is known to be the most efficient
image registration algorithm. We introduce a new algorithm, the Efficient Forward
Compositional image registration: this algorithm avoids the necessity of inverting
the warping function, and provides a new interpretation of the working mechanisms
of the inverse compositional alignment. By using this information, we propose two
fundamental requirements that guarantee the convergence of compositional image
registration methods.
Finally, we support our claims by using extensive experimental testing with
synthetic and real-world data. We propose a distinction between image registration
and tracking when using efficient algorithms. We show that, depending whether the
fundamental requirements are hold, some efficient algorithms are eligible for image
registration but not for tracking.
xix

Notations
Speciﬁc Sets and Constants
X Set of target points or target region.
Ω Set of target points currently visible.
N Number of points in the target region—i.e., N = X .
NΩ Number of visible target points—i.e., NΩ = Ω .
P Dimension of the parameter space.
C Number of image channels.
K Dimension of the deformations space.
F Number of frames in the image sequence.
Vectors and Matrices
a Lowercase bold letters denote vectors.
Am×n Monospace uppercase letters denote m × n matrices.
vec(A) Vectorization of matrix A: if A is a m × n matrix, vec(A) is
a mn × 1 vector.
Ik ∈ Mk×k k × k identity matrix.
I 3 × 3 identity matrix.
0k ∈ Rk
k × 1 vector full with zeroes.
0m×n ∈ Mm×n m × n matrix full with zeroes.
Camera Model Notations
x ∈ R2
Pixel location at the image.
ˆx ∈ P2
Location in the Projective space.
X ∈ R3
Point in Cartesian coordinates
Xc ∈ R3
Point expressed in the camera reference system.
K ∈ M3×3 3 × 3 camera intrinsics matrix.
P ∈ M3×4 3 × 4 camera projection matrix.
1

Imaging Notations
T (x) ∈ Rc
Brightness value of the template image at pixel x.
I(x, t) ∈ Rc
Brightness value of the current image for pixel x at instant t.
It(x) Another notation for I(x, t).
T,It Vector forms of functions T and It.
[ ] Composite function of I ◦ p, that is I[x] = I(p(x)).
Optimization Notations
µ ∈ RP
Column vector of motion parameters.
µ0 ∈ RP
Initial guess of the optimization.
µi ∈ RP
Parameters at the i-th iteration of the optimization.
µ∗
∈ RP
Actual optimum of the optimization.
µt ∈ RP
Parameters at image t.
µJ ∈ RP
Parameters where the Jacobian is computed for eﬃcient algorithms.
δµ ∈ RP
Incremental step at the current state of the optimization.
ℓ(δµ) Linear model for the incremental step δµ.
L(δµ) Local minimizer for the incremental step δµ.
r(µ) ∈ RN
N × 1 vector-valued residuals function at parameters µ.
∇ˆxf(x) Derivatives of function f with respect to variables x, instantiated at x.
J(µ) ∈ MN×P Jacobian matrix of the brightness dissimilarity at µ (i.e., J(µ) =
∇ˆµD(X; µ)).
H(µ) ∈ MP×P Hessian matrix of the brightness dissimilarity at µ (i.e., H(µ) =
∇2
ˆµD(X; µ)).
Warp Function Notations
f(x; µ) : Rn
× RP
→ Rn
Motion model or Warp.
p : Rn
→ R2
Projection into the Cartesian plane.
R ∈ M3×3 3 × 3 rotation matrix.
ri ∈ R3
Columns of the rotation matrix R (i.e., R = (r1, r2, r3)).
t ∈ R3
Translation vector in Euclidean space.
D : R2
× Rp
→ R Dissimilarity function.
U : Rp
× Rp
→ Rp
Parameters update function.
ψ : Rp
× Rp
→ Rp
Jacobian update function for algorithm GIC.
2

Factorization Notations
⊗ Kronecker product.
⊙ Row-wise Kronecker product.
S(x) Constant matrix in the factorization method that is computed from the
target structure and camera calibration.
M(µ) Variable matrix in the factorization methods that is computed from
motion parameters.
W ∈ Mp×p Weighting matrix for Weighted Least-Squares.
π : Rn
→ Rn
Permutation of the set {1, . . . , n}.
Pπ(n) ∈ Mn×n Permutation matrix of the set {1, . . . , n}.
π(n, q) Permutation of the set {1, . . . , n} with ratio q.
3D Models Notations
F ⊂ R2
Reference frame for algorithm HB.
S : F → R3
Target shape function.
T : F → RC
Target texture function.
u ∈ F Target coordinates in the reference frame.
S ∈ M3×Nv Target 3D shape.
s ∈ R3
Shape coordinates in the Euclidean space.
s0
∈ R3
Mean shape of the target generative model.
si
∈ R3
i-th basis of deformation of the target generative model.
n⊤
∈ R3
Normal vector to a given triangle. n⊤
is normalized with the triangle
depth (i.e., if x belongs to the triangle, then n⊤
x = 1).
Bs ∈ M3×K Basis of deformations.
c ∈ RK
Vector containing K deformation coefficients.
HA ∈ M3×3 Affine warp between the image reference frame and F.
˙R∆ Derivatives of the rotation matrix R with respect to the Euler angle
∆ = {α, β, γ}.
λ ∈ R Homogeneous scale factor.
v ∈ R3
Change of variables defined as v = K−1
HA û.
Function Naming Conventions
fH82D : P2
→ P2
8-dof homography.
fH6P : P2
→ P2
Plane-induced homography.
fH6S : P2
→ P2
Shape-induced homography.
f3DTM : P2
→ P2
3D Textured Model motion model.
fH6D : P2
→ P2
Deformable shape-induced homography.
f3DMM : P2
→ P2
3D Textured Morphable Model motion model.
ε : Rp
→ R Reprojection error function.
3

Algorithms Naming Conventions
LK Lucas-Kanade algorithm [Lucas and Kanade, 1981]1
.
HB Hager-Belhumeur factorization algorithm [Hager and Belhumeur, 1998].
IC Inverse Compositional algorithm [Baker and Matthews, 2004].
FC Forward Compositional algorithm [Baker and Matthews, 2004].
GIC Generalized Inverse Compositional algorithm [Brooks and Arbel, 2010].
EFC Eﬃcient Forward Compositional algorithm.
LKH8 Lucas-Kanade algorithm for homographies.
LKH6 Lucas-Kanade algorithm for plane-induced homographies.
LK3DTM Lucas-Kanade algorithm for 3D Textured Models (rigid).
LK3DMM Lucas-Kanade algorithm for 3D Morphable Models (deformable).
HB3DTR Full-factorized HB algorithm for 6-dof motion in R3
[Sepp, 2006].
HB3DTM Full-factorized HB algorithm for 3D Textured Models (rigid).
HB3DMM Full-factorized HB algorithm for 3D Morphable Models (deformable).
HB3DMMSF Semi-factorized HB algorithm for 3D Morphable Models.
HB3DMMNF HB algorithm for 3D Morphable Models without the factorization stage.
ICH8 IC algorithm for homographies.
ICH6 IC algorithm for plane-induced homographies.
GICH8 IC algorithm for homographies.
GICH6 IC algorithm for plane-induced homographies.
IC3DRT IC algorithm for 6-dof motion in R3
[Mu˜noz et al., 2005].
FCH6PP FC algorithm for plane+parallax homographies.
1
We only show the most relevant cite for each algorithm
4

Chapter 1
Introduction
This thesis deals with the problems of registration and tracking in sequences of
images. Both problems are classical topics in Computer Vision and Image Processing
that have been widely studied in the past. We summarize the subjects of this thesis
in the dissertation title:
Efficient Model-based 3D Tracking by using Direct Image Registration
What is 3D Tracking? Let the target be a part of the scene—e.g. the cube in
Figure 1.1. We define tracking as the process of repeatedly computing the target
state in a sequence of images. When we describe this state as the relative 3D
orientation and location of the target with respect the coordinate system of the
camera (or another arbitrary reference system), we refer to this process as 3D rigid
tracking (see Figure 1.1). If we also include state parameters that describe the
possible deformation of the object, we have 3D nonrigid or deformable tracking (see
Figure 1.2). We use 3D tracking to refer to both the rigid or the nonrigid case.
What is Direct Image Registration? When the target is imaged by two cam-
eras with different point-of-view, the resulting images are different although they
represent the same portion of the scene (see Figure 1.3). Image Registration or
Image Alignment computes the geometric transformation that best aligns the coor-
dinate systems of both images such that their pixel-wise differences are minimal (cf.
Figure 1.3). We say that the image registration is a direct method when we register
the coordinate systems by just using the brightness differences of the images.
What is Model-based? We say that a technique is model-based when we re-
strict the information from the real world by using certain assumptions: on the
target dynamics, on the target structure, on the camera sensing process, etc—e.g.
in Figure 1.1 we model the target with a cube structure and rigid body dynamics.
5

Figure 1.1: Example of 3D rigid tracking (Left) Selected frames of a scene containing
a textured cube. We track the object and we overlay its state in blue. (Right) The relative
position of the camera—represented by a coloured pyramid—and the cube is computed
from the estimated 3D parameters.
Figure 1.2: 3D Nonrigid Tracking. Selected frames from a sequence of a cushion
under a bending motion. We track some landmarks on the cushion through the sequence,
and we plot the resulting triangular mesh for the selected frames. The motion of the
landmarks is both global—translation of the mesh—and local—changes on the relative
position of the mesh vertices due to the deformation. Source: Alessio del Bue.
And Finally, What does Efficient mean? We say that a method is efficient
if it substantially improves the computation time with respect to gold-standard
techniques. In a more practical way, efficient is equivalent to real-time—i.e. the
6

Figure 1.3: Image registration (Top-row)Image of a portion of the scene under two
distinct point-of-views. We have outlined the target in blue (Top-left) and green (Top-
right). (Bottom)The left image is warped such that the coordinates of the target match
up in both images. Source:Graﬃti sequence, from Oxford Visual Geometry Group.
tracking procedure operates at 25 frames per second.
1.1 Motivation
In less than thirty years, and quite enclosed to academic or military environments,
video tracking has a widespread acknowledgement mainly thanks to the media.
7

Thus, video tracking is now a staple in sci-fi shows and films where futuristic Head-
up Displays (hud) work in a show-and-tell fashion, a camera surveillance system
can locate an object or a person, or a robot can address people and even recognize
their mood.
However, tv is, sadly to say, years ahead of reality. Actual video tracking systems
are still in a primitive stage: they are inaccurate, sloppy, slow, and usually work in
laboratory conditions only. Anyway, video tracking progression increases by leaps
and bounds and it will probably match some sci-fi standards soon.
We investigate the problem of efficiently tracking an object in a video sequence.
Nowadays there exists several efficient optimization algorithms for video tracking
or image registration. We study two of the fastest algorithms available: the Hager-
Belhumeur factorization algorithm and the Baker-Matthews inverse compositional
algorithm. Both algorithms, although very efficient for planar registration, present
diverse problems for 3D tracking. This thesis studies which assumptions can be done
with these algorithms whilst underlining their limitations through extensive testing.
Eventually, the objective is to provide a detail description of each algorithm, pointing
out pros and cons, leading to a kind of Quick Guide to Efficient Tracking Algorithms.
1.2 Applications
Typical applications for 3D tracking include target localization for military oper-
ations; security and surveillance tasks such as person counting, face identification,
people detection, determining people activity or detecting left objects; it also in-
cludes human-computer interaction for computer security, aids for disabled people
or even controlling video-games. Tracking is used for augmenting video sequences
with additional information such as advertisements, expanding information about
the scene, or adding or removing objects of the scene. We show some examples of
actual industrial applications in Figure 1.4.
A tracking process that is widely used in film industry is Motion Capture: we
track the motion of the different parts of the an actor’s body using a suit equipped
with reflective markers; then, we transfer the estimated motion to a computer-
generated character (see Figure 1.5). Using this technique, we can animate a syn-
thetic 3D character in a movie as Gollum in the Lord of the Rings trilogy (2001),
or Jar-Jar Binks in the new Star Wars trilogy (1999). Another relevant movies
that employ these techniques are Polar Express (2004), King Kong (2005), Beowulf
(2007), A Christmas Carol (2009), and Avatar (2009). Furthermore, we can generate
a complete computer-generated movie populated with characters animated through
motion capture. Facial motion capture is of special interest for us: we animate a
computer-generated facial expression by facial expression tracking (see Figure 1.5).
We turn our attention to markerless facial motion capture, that is, the process
of recovering the face expression and orientation without using fiducial markers.
Markerless motion capture does not require special equipment—such as close-up
8

Figure 1.4: Industrial applications of 3D tracking. (Top-left) Augmented reality
inserts virtual objects into the scene. (Top-middle) Augmented reality shows additional
information about tracked objects in the scene. Source:Hawk-eye, Hawk-Eye Innovations
Ltd., copyright c 2008. Top-right Tracking a pedestrian for video surveillance. Source:
Martin Communications, copyright c 1998-2007. Bottom-left People flow counter by
tracking. Source: EasyCount, by Keeneo, copyright c 2010. Bottom-middle Car track-
ing detects possible traffic infractions or estimates car speed. Source: Fibridge, copy-
right c . Bottom-right Body tracking is used for interactive controlling of video-games.
Source: Kinect, Microsoft, copyright c 2010.
cameras—or a complicated set-up on the actor’s face—such as special reflective
make-up or facial stickers. In this thesis we propose a technique that captures facial
expressions motion by only using brightness information and a prior knowledge on
the deformation of the target (see Figure 1.6).
1.3 Contributions of the Thesis
We outline the remaining chapters of the thesis and their principal contributions as
follows:
Chapters2: Literature Review We provide a detailed survey of the literature
on techniques for both image registration and tracking.
Chapters3: Efficient Image Registration We review the state-of-the-art on
efficient methods. We introduce the taxonomy for efficient registration algorithms:
9

Figure 1.5: Motion capture in the film industry. Facial and body motion capture
from AvatarTM (Top-row) and Polar ExpressTM (Bottom-row). (Left-column) The
body motion and head pose are computed using reflective fiducial markers—grey spheres
of the motion capture jumpsuit. For facial expression capture they use plenty of smaller
markers and even close-up cameras. (Right-column) They use the estimated motion to
animate characters in the movie. Source: Avatar, 20th Century Fox, copyright c 2009;
Polar Express, Warner Bros. Pictures, copyright c 2004.
an algorithm is classified as either additive or compositional.
Chapter 4: Equivalence of Gradients We introduce the gradient equiva-
lence equation constraint: we show that the accomplishment of this assumption
has positive effects on the performance of the algorithms.
Chapter 5: Additive Algorithms We review which constraints determine the
convergence of additive registration algorithms, specially the factorization approach.
We provide a methodical procedure to factorize an algorithm in general form; we
state a basic set of theorems and lemmas that enable us to systematize the factor-
ization. We introduce two tracking algorithms using factorization: one for rigid 3D
objects, and other for deformable 3D objects.
10

Figure 1.6: Markerless facial motion capture. (Top) Several frames where the
face modifies both its orientation—due to a rotation—and its shape structure—due to
changes in facial expression. (Bottom) The tracking state vector includes both pose and
deformation. Legend: Blue Actual projection of the target shape using the estimated
parameters; Pink Highlighted projections corresponding to profiles of the jaw, eyebrows,
lips and nasolabial wrinkles.
Chapter 6: Compositional Algorithms We review the basic inverse composi-
tional algorithm. We introduce an alternative efficient compositional algorithm that
is equivalent to the inverse compositional algorithm under certain assumptions. We
show that if the gradient equivalent equation holds then both efficient compositional
methods shall converge.
Chapter 7: Computational Complexity We study the resources used by the
registration algorithms in terms of their computational complexity. We compare the
theoretical complexities of efficient and nonefficient algorithms.
Chapter8: Experiments We devise a set of experimental tests that shall con-
firm our assumptions on the registration algorithms, that is, (1) the dependence
of the convergence on the algorithm constraint, and (2) evaluate the theoretical
complexities with actual data.
Chapter 9: Conclusions and Future Work Finally, we drawn conclusions
about where each technique is more suitable to be used, and we provide insight into
future work to improve the proposed methods.
11

Chapter 2
Literature Review
In this chapter we review the basic literature on tracking and image registration.
First we introduce the basic similarities and differences between image registration
and tracking. Then, we review the usual methods for both tracking and image
registration.
2.1 Image Registration vs. Tracking
The frontier between image registration and tracking is a bit fuzzy: tracking identi-
fies the location of an object in a sequence of images, whereas registration finds the
pixel-to-pixel correspondence between a pair of images. Note that in both cases we
compute a geometric and photometric transformation between images: pairwise in
the context of image registration and among multiple images for the tracking case.
Although we may indistinctly use the terms registration and tracking, we define the
following subtle semantic differences between them:
• Image registration finds the best alignment between two images of the same
scene. We use use a geometric transformation to align the images of both
cameras. We consider that image registration emphasizes in finding the best
alignment between two images in visual terms, not in accurately recovering
parameters of the transformation—this is usually the case in e.g., medical
applications.
• Tracking finds the location of a target object in each frame of a sequence. We
assume that the difference of object position between two consecutive frames is
small. In tracking we are typically interested in recovering the parameters de-
scribing the state of the object rather than the coordinates of the location: we
can describe an object using richer information that just its position (e.g. 3D
orientation, modes of deformation, lighting changes, etc.). This is usually the
case in robotics [Benhimane and Malis, 2007; Cobzas et al., 2009; Nick Molton,
2004], or augmented reality [Pilet et al., 2008; Simon et al., 2000; Zhu et al.,
2006].
13

Also, image registration involves two images with arbitrary baseline whereas track-
ing usually operates in a sequence with a small inter-frame baseline. We assume
that tracking is a higher level problem than image registration. Furthermore, we
propose a tracking-by-registration approach: we track an object through a sequence
by iteratively registering pairs of consecutive images [Baker and Matthews, 2004];
however, we can perform tracking without any registration at all (e.g. tracking-
by-detection [Viola and Jones, 2004], or tracking-by-classification [Vacchetti et al.,
2004]).
2.2 Image Registration
Image registration is a classic topic in computer vision and numerous approaches
have been proposed in the literature; two good surveys in the subject are [Brown,
1992] and [Zitova, 2003]. The process involves computing the pixel-to-pixel corre-
spondence between the two images: that is, for each pixel on one image we find
the corresponding pixel in the other image so that both pixels project from the
same actual point in the scene (cf. Figure 1.1). Applications include image mo-
saicing [Capel, 2004; Irani and Anandan, 1999; Shum and Szeliski, 2000], video
stitching [Caspi and Irani, 2002], super-resolution [Capel, 2004; Irani and Peleg,
1991], region tracking [Baker and Matthews, 2004; Hager and Belhumeur, 1998; Lu-
cas and Kanade, 1981], recovering scene/camera motion [Bartoli et al., 2003; Irani
et al., 2002], or medical image analysis [Lester and Arridge, 1999].
Image registration methods commonly fall in one of the two following groups [Bar-
toli, 2008; Capel, 2004; Irani and Anandan, 1999]:
Direct methods A direct image registration method aligns two images by only
using the colour—or intensity in greyscale data—values of the pixels that
are common to both images (namely, the region of support). Direct meth-
ods minimize an error measure based on image brightness from the region of
support. Typical error measures include a L2
-norm of the brightness differ-
ence [Irani and Anandan, 1999; Lucas and Kanade, 1981], normalized cross-
correlation [Brooks and Arbel, 2010; Lewis, 1995], or mutual information [Dow-
son and Bowden, 2008; Viola and Wells, 1997].
Feature-based methods In feature-based methods, we align two images by com-
puting the geometric transformation between a set of salient features that
we detect in each image. The idea is to abstract distinct geometric image
features that are more reliable than the raw intensity values; typically these
features show invariance with respect to modifications of the camera point-of-
view, illumination conditions, scale, or orientation of the scene [Schmid et al.,
2000]. Corners or interest points [Bay et al., 2008; Harris and Stephens, 1988;
Lowe, 2004; Torr and Zisserman, 1999] are classical features in the literature,
although we can use other features such us edges [Bartoli et al., 2003], or
extremal image regions [Matas et al., 2002].
14

Direct or feature-based methods? Choosing between direct or feature-based
methods is not an easy task: we have to know the strong points of each method
and for what applications it is more suitable. A good comparison between the two
types of methods is [Capel, 2004]. Feature-based methods typically show strong
invariance to a wide range of photometric and geometric transformation of the im-
age, and they are more robust to partial occlusions of the scene that their direct
counterparts [Capel, 2004; Torr and Zisserman, 1999]. On the other hand, direct
methods can align images with sub-pixel accuracy, estimate dominant motion even
when multiple motion are present, and they can provide dense motion field in case of
3D estimation [Irani and Anandan, 1999]. Moreover, direct methods do not require
high-frequency textured surfaces (corners) to operate, but have optimal performance
with smooth graylevel transitions [Benhimane et al., 2007].
2.3 Model-based 3D Tracking
In this section we define what is model-based tracking, and we review the previous
literature on 3D tracking of rigid and nonrigid objects. A special case of interest
for nonrigid objects is the 3D tracking of human faces or facial motion capture.
Recovering the 3D orientation and position of the target can be done with respect
to the camera (or an arbitrary reference system), or the relative displacement and
orientation of the camera with respect to the target (or another arbitrary reference
system in the scene), [Sepp, 2008]. A good survey on the subject is [Lepetit and
Fua, 2005].
2.3.1 Modelling assumptions
In model-based techniques we use a priori knowledge about the scene, the target,
or the sensing device, as a basis for the tracking procedure. We classify these
assumptions on the real-world information as follows:
Target model
The target model specifies how to represent the information about the structure of
the scene in our algorithms. Template tracking or template matching simply repre-
sents the target as the pixel intensity values inside a region defined on one image:
we call this region—or the image itself—the reference image or template. One of
the first proposed technique for template matching was [Lucas and Kanade, 1981],
although it was initially devised for solving optical flow problems. The literature
proposes numerous extensions to this technique [Baker and Matthews, 2004; Benhi-
mane and Malis, 2007; Brooks and Arbel, 2010; Hager and Belhumeur, 1998; Jurie
and Dhome, 2002a].
We may also allow the target to deform its shape: this deformation induces
changes in the target projected appearance. We model these changes in target
texture by using generative models such as eigenimages [Black and Jepson, 1998;
15

Buenaposada et al., 2009], Active Appearance Models (aam) [Cootes et al., 2001],
active blobs [Sclaroff and Isidoro, 2003], or subspace representation [Ross et al.,
2004]. Instead of modelling brightness variations we may represent target shape
deformation by using a linear model representing the location of a set of feature
points [Blanz and Vetter, 2003; Bregler et al., 2000; Del Bue et al., 2004], or Finite
Element Meshes [Pilet et al., 2005; Zhu et al., 2006]. Alternative approaches model
non-rigid motion of the target by using anthropometric data [Decarlo and Metaxas,
2000], or by using a probability distribution of the intensity values of the target
region [Comaniciu et al., 2000; Zimmermann et al., 2009].
These techniques are suitable to track planar objects of the scene. If we add fur-
ther knowledge about the scene, we can track more complex objects: with a proper
model we are able to recover 3D information. Typically, we use a wireframe 3D
model of the target and tracking consists on finding the best alignment between the
sensed image and the 3D model [Cipolla and Drummond, 1999; Kollnig and Nagel,
1997; Marchand et al., 1999]. We can augment this model by adding further texture
priors either from the image stream [Cobzas et al., 2009; Muñoz et al., 2005; Sepp
and Hirzinger, 2003; Vacchetti et al., 2004; Xiao et al., 2004a; Zimmermann et al.,
2006], or from and external source (e.g. a 3D scanner or a texture mosaic) [Hong
and Chung, 2007; La Cascia et al., 2000; Masson et al., 2004, 2005; Pressigout and
Marchand, 2007; Romdhani and Vetter, 2003].
Motion model
The motion model describes the target kinematics (i.e. how the object modifies
its position in the image/scene). The motion model is tightly coupled to the tar-
get model: it is usually represented by a geometric transformation that maps the
coordinates of the target model into a different set of coordinates. For a planar
target, these geometric transformations are typically affine [Hager and Belhumeur,
1998], homographic [Baker and Matthews, 2004; Buenaposada and Baumela, 1999],
or spline-based warps [Bartoli and Zisserman, 2004; Brunet et al., 2009; Lester and
Arridge, 1999; Masson et al., 2005]. For actual 3D targets, the geometric warps
account for computing the rotation and translation of the object using a 6 degree-
of-freedom (dof) rigid body transformation [Cipolla and Drummond, 1999; La Cascia
et al., 2000; Marchand et al., 1999; Sepp and Hirzinger, 2003].
Camera model
The camera model specifies how the images are sensed by the camera. The pin-
hole camera models the imaging device as a projector of the coordinates of the
scene [Hartley and Zisserman, 2004]. For tracking zoomed objects located far away,
we may use orthographic projection [Brand and R.Bhotika, 2001; Del Bue et al.,
2004; Tomasi and Kanade, 1992; Torresani et al., 2002]. The perspective projection
accounts for perspective distortion, and it is more suitable for close-up views [Muñoz
et al., 2005, 2009]. The camera model may also account for model deviations such
as lens distortion [Claus and Fitzgibbon, 2005; Tsai, 1987].
16

Other model assumptions
We can also model prior photometric knowledge about the target/scene such as
illumination cues [La Cascia et al., 2000; Lagger et al., 2008; Romdhani and Vetter,
2003], or global colour [Bartoli, 2008].
2.3.2 Rigid Objects
We can follow two strategies to recover the 3D parameters of a rigid object:
2D Tracking The first group of methods involves a two-step process: first we
compute the 2D motion of the object as a displacement of the target projection
on the image; second, we recover the actual 3D parameters from the computed
2D displacements by using the scene geometry. A natural choice is to use
optical flow: [Irani et al., 1997] computes the dominant 2D parametric motion
between two frames to register the images; the residual displacement—the
image regions that cannot be registered—is used to recover the 3D motion.
When the object is a 3D plane, we can use a homographic transformation to
compute plane-to-plane correspondences between two images; then we recover
the actual 3D motion of the plane using the camera geometry [Buenaposada
and Baumela, 2002; Lourakis and Argyros, 2006; Simon et al., 2000]. We
can also compute the inter-frame displacements by using linear regressors or
predictors, and then we robustly adjust the projections to a target model—
using RANSAC—to compute the 3D parameters [Zimmermann et al., 2009]. An
alternative method is to compute pixel-to-pixel correspondences by using a
classifier [Lepetit and Fua, 2006], and then recover the target 3D pose using
POSIT [Dementhon and Davis, 1995], or equivalent methods [Lepetit et al.,
2009].
3D Tracking These methods directly compute the actual 3D motion of the object
from the image stream. They mainly use a 3D model of the target to compute
the motion parameters; the 3D model contains a priori knowledge of the target
that improves the estimation of motion parameters (e.g. to get rid of projec-
tive ambiguities). The simplest way to represent a 3D target is using a texture
model—a set of image patches sensed from one or several reference images—as
in [Cobzas et al., 2009; Devernay et al., 2006; Jurie and Dhome, 2002b; Masson
et al., 2004; Sepp and Hirzinger, 2003; Xu and Roy-Chowdhury, 2008]. The
main drawback of these methods is the lack of robustness against changes in
scene illumination, specular reflections. We can alternatively fit the projection
of a 3D wireframe model (e.g. a cad model) to the edges of the image [Drum-
mond and Cipolla, 2002]. However, these methods have also problems with
cluttered backgrounds [Lepetit and Fua, 2005]. To gain robustness, we can use
hybrid models of texture and contours such as [Marchand et al., 1999; Masson
et al., 2003; Vacchetti et al., 2004], or simply use an additional model to deal
with illumination [Romdhani and Vetter, 2003].
17

2.3.3 Nonrigid Objects
Tracking methods for nonrigid objects fall in the same categories that we used for
rigid ones. Point-to-point correspondences of the deformable target can recover
the pose and/or deformation parameters using subspace methods [Del Bue, 2010;
Torresani et al., 2008], or fitting a deformable triangle mesh [Pilet et al., 2008;
Salzmann et al., 2007]. We can alternatively fit the 2D silhouette of the target to a
3D skeletal deformable model of the object [Bowden et al., 2000].
Direct estimation of the 3D parameters unifies the processes of matching pixel
correspondences, and estimating the pose and deformation of the target. [Brand,
2001; Brand and R.Bhotika, 2001] constrains the optical flow by using a linear
generative model to represent the deformation of the object. [Gay-Bellile et al.,
2010] models the object 3D deformations, including self-occlusions, by using a set
of Radial Basis Functions (rbf).
2.3.4 Facial Motion Capture
Estimation of facial motion parameters is a challenging task; head 3D orientation
was typically estimated by using fiducial markers to overcome the inherent difficulty
of the problem [Bickel et al., 2007].
However, markerless methods have been also developed in recent years. Facial
motion capture involves recovering head 3D orientation and/or face deformation due
to changes in expression. We first review techniques for recovering head 3D pose,
then we review techniques for recovering both pose and expression.
Head pose estimation There are numerous techniques to compute head pose or
3D orientation. In the following, we review a number of them—a recent detailed
survey on the subject is [Murphy-Chutorian and Trivedi, 2009]. The main difficulty
of estimating head pose lies on the nonconvex structure of the human head. Classic
2D approaches such as [Black and Yacoob, 1997; Hager and Belhumeur, 1998] are
only suitable to track motions of the head parallel to the image plane: the rea-
son is that these methods only use information from a single reference image. To
fully recover the 3D rotation parameters of the head we need additional informa-
tion. [La Cascia et al., 2000] uses a texture map that was computed by cylindrical
projection of different point-of-view images of the head; [Baker et al., 2004a; Jang
and Kanade, 2008] also use an analogous cylindrical model. In a similar fashion, we
can use a 3D ellipsoid shape [An and Chung, 2008; Basu et al., 1996; Choi and Kim,
2008; Malciu and Prêteux, 2000]. Instead of using a cylinder or an ellipsoid, we can
have a detailed model of the head like a 3D Morphable Model (3dmm) [Blanz and
Vetter, 2003; Muñoz et al., 2009; Xu and Roy-Chowdhury, 2008], an aam coupled
together with a 3dmm [Faggian et al., 2006], or a triangular mesh model of the
face [Vacchetti et al., 2004]. The latter is robustly tracked in [Strom et al., 1999]
using an Extended Kalman Filter. We can also have a head model with reduced
complexity as in [B. Tordoff et al., 2002].
18

Face expression estimation A change of facial expression induces a deforma-
tion in the 3D structure of the face. The estimation of this deformation can be
used for face expression recognition, expression detection, or facial motion trans-
fer. Classic 2D approaches such as aams [Cootes et al., 2001; Matthews and Baker,
2004] are only suitable to recover expressions from a frontal face. 3D aams are the
three-dimensional extension to these 2D methods: they adjust a statistical model
of 3D shapes and texture—typically a PCA model—to the pixel intensities of the
image [Chen and Wang, 2008; Dornaika and Ahlberg, 2006]. Hybrid methods that
combine 2D and 3D aams show both real-time performance and actual 3D head
pose estimation: we can use the 3D aams to simultaneously constrain the 2D aams
motion and compute the 3D pose [Xiao et al., 2004b], or directly compute the fa-
cial motion from the 2D aams parameters [Zhu et al., 2006]. In contrast to pure
2D aams, 3D aams can recover actual 3D pose and expression from faces that are
not frontal to the camera. However, the out-of-plane rotations that can be recov-
ered by these methods are typically smaller than using a pure 3D model (e.g. a
3dmm). [Blanz and Vetter, 2003; Romdhani and Vetter, 2003] search the best con-
figuration for a 3dmm such that the differences between the rendered model and the
image are minimal; both methods also show great performance recovering strong fa-
cial deformations. Real-time alternatives using 3dmm include [Hiwada et al., 2003;
Muñoz et al., 2009]. [Pighin et al., 1999] uses a linear combination of 3D face models
fitted to match the images to estimate realistic facial expressions. Finally, [Decarlo
and Metaxas, 2000] derives an anthropometric physically-based face model that may
be adjusted to each individual face target; besides, they solve a dynamic system for
the face pose and expression parameters by using optical flow constrained by the
edges of the face.
19

Chapter 3
Efficient Direct Image Registration
3.1 Introduction
This chapter reviews the problem of efficiently registering two images. We define
Direct Image Alignment (dia) problem as the process that computes the trans-
formation between two frames using only image brightness information. We orga-
nize the chapter as follows: Section 3.2 introduces basic registration notions; Sec-
tion 3.3 reviews additive registration algorithms such as Lucas-Kanade or Hager-
Belhumeur; Section 3.4 reviews compositional registration algorithms such as Baker
and Matthews’ Forward Compositional and Inverse Compositional; finally, other
methods are reviewed in Section 3.5.
3.2 Modelling Assumptions
This section reviews those assumptions on the real world that we use to mathemat-
ically model the registration procedure. We introduce the notation on the imaging
process through a pinhole camera. We ascertain the Brightness Constancy Assump-
tion or Brightness Constancy Constraint (bcc) as the cornerstone of the direct
image registration techniques. We also pose the registration problem as an itera-
tive optimization problem. Finally, we provide a classification of the existing direct
registration algorithms.
3.2.1 Imaging Geometry
We represent points of the scene using Cartesian coordinates in R3
(e.g. X =
(X, Y, Z)⊤
). We represent points on the image with homogeneous coordinates, so
that the pixel position x = (i, j)⊤
is represented using the notation for augmented
points as ˜x = (i, j, 1)⊤
. The homogeneous point ˜x = (x1, x2, x3)⊤
is conversely
represented in Cartesian coordinates using the mapping p : P2
→ R2
, such that
p(˜x) = x = (x1/x3, x2/x3). The scene is imaged through a perfect pin-hole cam-
era [Hartley and Zisserman, 2004]; by abuse of notation, we define the perspective
21

Figure 3.1: Imaging geometry. An object of the scene is imaged through camera
centres C1 and C2 onto two distinct images I1 and I2 (related by a rotation R and
a translation t). The point X is projected to the points x1 = p(K I|0 ˜X) and x2 =
p(K R − Rt ˜X) in the two images.
projection p : R3
→ R2
that maps scene coordinates onto image points,
x = p(Xc) =
k⊤
1 Xc
k⊤
3 Zc
,
k⊤
2 Yc
k⊤
3 Zc
⊤
,
where K = (k⊤
1 , k⊤
2 , k⊤
3 )⊤
is the 3 × 3 matrix that contains the camera intrinsics
(cf. [Hartley and Zisserman, 2004]), and Xc = (Xc, Yc, Zc)⊤
. We implicitly assume
that Xc represents a point in the camera reference system. If the points to project
are expressed in an arbitrary reference system of the scene we need an additional
mapping; hence, the perspective projection for a point X in the scene is
˜x = K R − Rt
X
1
,
where R and t are the rotation and translation between the scene and the camera
coordinate system (see Figure 3.1). Our input is a smooth sequence of images—i. e.
inter-frame diﬀerences are small—where It is the t-th frame of the sequence. We de-
note T as the reference image or template. Images are discrete matrices of brightness
values, although we represent them as functions from R2
to RC
, where C is the num-
ber of image channels (i.e. C = 3 for colour, and C = 1 for gray-scale images): It(x) is
the brightness value at pixel x. For non-discrete pixel coordinates, we use bilinear in-
terpolation. If X is a set of pixels, we collect the brightness values of I(x), ∀x ∈ X in
a single column vector as I(X)—i.e., I(X) = (I(x1), . . . , I(xN))⊤
, {x1, . . . , xN} ∈ X.
22

3.2.2 Brightness Constancy Constraint
The bcc relates brightness information between two frames of a sequence [Hager
and Belhumeur, 1998; Irani and Anandan, 1999]. The reference image T is one
arbitrary image of the sequence. We define the target region X as a set of pixel
coordinates X = {x1, . . . , xN} defined on T (see Figure 3.2). We define the template
as the image values of the target region, that is, T (X). Let us assume we know
the transformation of the target region between T and another arbitrary image of
the sequence, It. The motion model f defines this transformation as Xt = f(X; µt),
where the set of coordinates Xt is the target region on It and µt are the motion
parameters. The bcc states that the brightness values of the template T and the
input image It warped by f with parameters µt should be equal,
T (X) = It(f(X; µt)). (3.1)
The direct conclusion from Equation 3.1 is that the brightness of the target does not
depend on its motion—i.e., the relative position and orientation of the camera with
respect the target does not affect the brightness of the latter. However, we may aug-
ment the bcc to include appearance changes [Black and Jepson, 1998; Buenaposada
et al., 2009; Matthews and Baker, 2004], and changes in illumination conditions due
to ambient [Bartoli, 2008; Basri and Jacobs, 2003] or specular lighting [Blanz and
Vetter, 2003].
3.2.3 Image Registration by Optimization
Direct image registration is usually posed as an optimization problem. We minimize
an error function based on the brightness pixel-wise difference that is parameterized
by motion variables:
µ∗
= arg min
µ
{D(X; µ)2
}, (3.2)
where
D(X; µ) = T (X) − It(f(X; µ)) (3.3)
is a dissimilarity measure based on the bcc (Equation 3.1).
Descent Methods
Recovering these parameters is typically a non-linear problem as it depends on
image brightness—which is usually non-linearly related to the motion parameters.
The usual approach is iterative gradient-based descent (GD): from a starting point
µ0 in the search space, the method iteratively computes a series of partial solu-
tions µ1, µ2, . . . µk that, under certain conditions, converge to the local minimizer
µ∗
[Madsen et al., 2004] (see Figure 3.2). We typically use Gauss-Newton (GN)
methods for efficient registration because they provide good convergence without
computing second derivatives (see Appendix A). Hence, the basic GN-based algo-
rithm for image registration operates as we outline in Algorithm 1 and depict in
Figure 3.3. We describe the four stages of the algorithm in the following:
23

Figure 3.2: Iterative gradient descent image registration. Top-left Template
image for the registration. We highlight the target region as a green quadrangle. Top-
right Image that we register against the template. We generate the image by rotating the
image around its centre and translating it in the X-axis. We highlight the corresponding
target region in yellow. We also display the initial guess for the optimization as a green
quadrangle. Notice that it exactly corresponds to the position of the target region at the
template. Bottom-left Contour plot of the image brightness dissimilarity. The axis show
the values of the search space: image rotation and translation. We show the successive
iterations in the search space: we reach the solution in four steps—µ0 to µ4. Bottom-
right We show the target region that corresponds to the parameters of each iteration.
The colour of each quadrangle matches the colour of the parameters that generated it as
seen in the Bottom-left ﬁgure.
24

Dissimilarity measure The dissimilarity measure is a function on the image bright-
ness error between two images. The usual measure for image registration is
the Sum of Squared Differences (ssd), that is, the L2
-norm of the difference
of pixel brightness (Equation 3.3) [Brooks and Arbel, 2010; Hager and Bel-
humeur, 1998; Irani and Anandan, 1999; Lucas and Kanade, 1981]. However,
we can use other measures such as normalized cross-correlation [Brooks and
Arbel, 2010; Lewis, 1995], or mutual information [Brooks and Arbel, 2010;
Dowson and Bowden, 2008; Viola and Wells, 1997].
Linearize the dissimilarity The next stage linearizes the brightness function about
the current search parameters µ; this linearization enables us to transform
the problem into a system of linear equations on the search variables. We
typically approximate the function using Taylor series expansion; depending
on how many terms—derivatives—we compute, we have optimisation methods
like Gradient Descent [Amberg and Vetter, 2009], Newton-Raphson [Lucas and
Kanade, 1981; Shi and Tomasi, 1994], Gauss-Newton [Baker and Matthews,
2004; Brooks and Arbel, 2010; Hager and Belhumeur, 1998] or even higher-
order methods [Benhimane and Malis, 2007; Keller and Averbuch, 2004, 2008;
Megret et al., 2008]. This is theoretically a good approximation when the dis-
similarity is small [Irani and Anandan, 1999], although the estimation can be
improved by using coarse-to-fine iterative methods [Irani and Anandan, 1999],
or by selecting appropriate pixels [Benhimane et al., 2007]. Although Taylor
series expansion is the usual approach to compute the coefficients of the sys-
tem, other approaches such as linear regression [Cootes et al., 2001; Jurie and
Dhome, 2002a] or numeric differentiation [Gleicher, 1997] may be used.
Compute the descent direction The descent direction is a vector δµ in the
search space such that D(µ+δµ) < D(µ). In a GN-based algorithm, we solve
the linear system of equations of the previous stage using least-squares [Baker
and Matthews, 2004; Madsen et al., 2004]. Note that we do not perform the
line search stage—i.e., we implicitly assume that the step size α = 1, cf.
Appendix A.
Update the search parameters Once we have determined the search direction,
δµ, we compute the next point in the series by using the update function
U : RP
→ RP
: µ1 = U(µ0, δµ). We compute the dissimilarity value at µ1 to
check convergence: if the dissimilarity is below a given threshold, then µ1 is the
minimizer µ∗
—i.e., µ∗
= µ1; in other case, we repeat the whole process (i.e.
µ1 are the actual current parameters µ) until we find a suitable minimizer.
3.2.4 Additive vs. Compositional
We turn our attention to the step 4 of Algorithm 1: how to compute the new es-
timation of the optimization parameters. In a GN optimization scheme, the new
25

Algorithm 1 Outline of the basic GN-based descent method for image
registration
On-line: Let µi = µ0 be the initial guess.
1: while no convergence do
2: Compute the dissimilarity function at D(µi).
3: Compute the search direction: linearize the dissimilarity and compute the
descent direction, δµi.
4: Update the optimization parameters:µi+1 = U(µi, δµi).
5: end while
Figure 3.3: Generic descent method for image registration. We initialize the
current parameter estimation at frame It+1 (µ = µ0) using the local minimizer at the
previous frame It (µ0 = µ∗
t ). We compute the Dissimilarity Measure between the Im-
age and the Template using µ (Equation 3.3). We linearize the dissimilarity measure
to compute the descent direction of the search parameters (δµ). We update the search
parameters using the search direction and we obtain an approximation to the minimum
(µ1). We check if µ1 is a local minimizer by using the brightness dissimilarity: if D is
small enough, then µ1 is the local minimizer (µ∗ = µ1); in other case, we repeat the
process with using µ1 as the current parameters estimation (µ = µ1).
26

parameters are typically computed by adding the former optimization parameters
to the search direction vector: µt+1 = µt + δµt (cf. Appendix A); this summation
is a direct consequence of the definition of Taylor series [Madsen et al., 2004]. We
call additive approaches to those methods that update parameters by using addi-
tion [Hager and Belhumeur, 1998; Irani and Anandan, 1999; Lucas and Kanade,
1981]. Nonetheless, Baker and Matthews [Baker and Matthews, 2004] subsequently
proposed a GN-based method that updated the parameters using composition—
i.e., µt+1 = µt ◦ δµt. We call these methods compositional approaches [Baker and
Matthews, 2004; Cobzas et al., 2009; Muñoz et al., 2005; Romdhani and Vetter,
2003; Xu and Roy-Chowdhury, 2008].
3.3 Additive approaches
In this section we review some works that use additive update. We introduce the
Lucas-Kanade algorithm, the fundamental work on direct image registration. We
show the basic algorithm as well as the common problems regarding the method. We
also introduce the Hager-Belhumeur approach to image registration and we point
out its highlights.
3.3.1 Lucas-Kanade Algorithm
The Lucas-Kanade (LK) algorithm [Lucas and Kanade, 1981] solves the registration
problem using a GN optimization scheme. The algorithm defines the residuals r of
Equation 3.3 as
r(µ) ≡ T(x) − I(f(x; µ)). (3.4)
The corresponding linear model for these residuals is
r(µ + δµ) ≃ ℓ(δµ) ≡ r(µ) + r′
(µ)δµ
= r(µ) + J(µ)δµ,
(3.5)
where
r(µ) ≡ T(x) − I(f(x; µ)), and J(µ) ≡
∂I(f(x; ˆµ)
∂ ˆµ ˆµ=µ
. (3.6)
Hence, our optimization process amounts to minimise now
δµ∗
= arg min
δµ
{ℓ(δµ)⊤
ℓ(δµ)} = arg min
δµ
{L(δµ)}. (3.7)
We compute the local minimizer of L(δµ) as follows:
0 = L′
(δµ) = ∇δµ r(µ)⊤
r(µ) + 2δµ⊤
J(µ)r(µ) + δµ⊤
J(µ)⊤
J(µ)δµ
= J(µ)r(µ) + J(µ)⊤
J(µ)δµ.
(3.8)
Again, we obtain an approximation to the local minimum at
δµ = − J(µ)⊤
J(µ)
−1
J(µ)⊤
r(µ), (3.9)
which we iteratively refine until we find a suitable solution. We summarize the
optimization process in Algorithm 2 and Figure 3.4.
27

Algorithm 2 Outline of the Lucas-Kanade algorithm.
2: Compute the residual function at r(µi) from Equation 3.4.
3: Linearize the dissimilarity: J = ∇µr(µi)
4: Compute the search direction: δµi = − J(µi)⊤
J(µi)
−1
J(µi)⊤
r(µi).
5: Update the optimization parameters:µi+1 = µi + δµi.
6: end while
Figure 3.4: Lucas-Kanade image registration. We initialize the current parameter
estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous frame It
(µ0 ≡ µ∗
t ). We compute the dissimilarity residuals between the Image and the Template
using µ (Equation 3.4). We linearize the residuals at the current parameters µ, and
we compute the descent direction of the search parameters (δµ). We additively update
the search parameters using the search direction and we obtain an approximation to the
minimum—i.e. µ1 = µ0 + δµ. We check if µ1 is a local minimizer by using the brightness
dissimilarity: if D is small enough, then µ1 is the local minimizer (µ∗ ≡ µ1); in other
case, we repeat the process with using µ1 as the current parameters estimation (µ ≡ µ1).
28

Known Issues
The LK algorithm is one instance of a well known technique for object tracking, [Baker
and Matthews, 2004]. The most remarkable feature of this algorithm is its robust-
ness: given a suitable bcc, the LK algorithm typically ensures a good convergence.
However, the algorithm has a series of weaknesses that degrades the overall perfor-
mance of the tracking:
Computational Cost The LK algorithm computes the Jacobian at each iteration
of the optimization loop. Furthermore, the minimization cycle is repeated
between each two consecutive frames of the video sequence. The consequence
is that the Jacobian is computed F × L times, where F is the number of frames
and L is the number of iterations in the optimization loop. The computational
burden of these operations is really high if the Jacobian is large: we have to
compute the derivatives at each point of the target region, and each point
contributes to a row in the Jacobian. As an example, Table 7.15—page 106—
compares the computational complexity of LK algorithm with respect to other
efficient methods.
Local Minima The GN optimization scheme, which is the basis for the LK al-
gorithm, is prone to get trapped in local minima. The very essence of the
minimization implies that the algorithm converges to the closest minimum
to the starting point. So, we must choose the initial guess of the optimiza-
tion very carefully to assure convergence to the true optimum. The best way
to guarantee that the starting point for tracking and the optimum are close
enough is imposing that the differences between consecutive images are small.
On the contrary, images with large baseline will cause problems to LK as falling
into local minima is more likely, which leads to incorrect alignment. To solve
this problem, common to all direct approaches, a pyramidal implementation
of the optimization may be used [Bouguet, 2000].
3.3.2 Hager-Belhumeur Factorization Algorithm
We review now an efficient algorithm for determining the motion parameters of the
target. The algorithm is similar to LK, but uses a priori information about the tar-
get motion and structure to save computation time. The Hager-Belhumeur (HB) or
factorization algorithm was first proposed by G. Hager and P. Belhumeur in [Hager
and Belhumeur, 1998]. The authors noticed the high computational cost when lin-
earizing the brightness error function in the LK algorithm: the dissimilarity depends
on each different frame of the sequence, It. The method focuses on how to efficiently
compute the Jacobian matrix of step 3 of the LK algorithm (see Algorithm 2). The
computation of the Jacobian in the HB algorithm has two separate stages:
1. Gradient replacement
The key idea is to use the derivatives at the template T instead of computing
the derivatives at frame It when estimating J. Hager and Belhumeur dealt with
29

this issue in a very neat way: they noticed that, if the bcc (Equation 3.1) re-
lated image and template brightness values, it could possibly relate also image
and template derivatives—cf. [Hager and Belhumeur, 1998]. The derivatives
of both sides of Equation 3.1 with respect to the target region coordinates are
∇xT (x) =∇xIt(f(x; µt)), x ∈ X,
=∇xIt(x)∇xf(x; µ), x ∈ X. (3.10)
On the other hand, we compute the Jacobian as
J =∇µt
It(f(x; µt)),
=∇xIt(x)∇µt
f(x; µ). (3.11)
We isolate the term ∇tIx(x) in Equations 3.10 and 3.11, and we equal the
remaining terms as follows:
J = ∇xT (x)∇xf(x; µ)−1
∇µt
f(x; µ). (3.12)
Notice that in Equation 3.12 the Jacobian depends on the template derivatives,
∇xT (x), which are constant. Using template derivatives speed up the whole
process up to 10-fold (cf. Table 7.16–page 106).
2. Factorization
Equation 3.12 reveals the internal structure of the Jacobian: it comprises
the product of three matrices: a matrix ∇xT (x) that depends on template
brightness values and two matrices,∇xf(x; µ)−1
and ∇µt
f(x; µ), whose values
depend on both the target shape coordinates and the motion parameters µt.
The factorization stage re-arranges the Jacobian internal structure such that
we speed up the computation of this matrix product.
A word about factorization In the literature, matrix factorization or ma-
trix decomposition refers to the process that expresses the values of a
matrix as the product of matrices of special types. One mayor example
is to factorize a matrix A into the product of a lower triangular ma-
trix L and and upper triangular matrix U, A = LU. This factorization
is called lu decomposition and it allows us to solve the linear system
Ax = b more eﬃciently: solving Ux = L−1
b require fewer additions and
multiplications than the original system, [Golub and Van Loan, 1996].
Other famous examples of matrix factorization are spectral decomposi-
tion, Cholesky factorization, Singular Value Decomposition (svd) and
qr factorization (see [Golub and Van Loan, 1996] for more information).
The key concept on using factorization in this problem states as follows:
Given a matrix product whose operands contain both constant and
variable terms, we want to re-arrange the product such that one
operand contains only constant values and the other one only con-
tains variable terms.
30

We rewrite this idea in equation as follows:
J = ∇xT (x)∇xf(x; µ)−1
∇µt
f(x; µ) = S(x)M(µ), (3.13)
where S(x) contains only target coordinate values and M(µ) contains only
motion parameters. The process to decompose the matrix J into the product
S(x)M(µ) is generally ad hoc: we must gain insight of the analytic structure
of the matrices ∇xf(x; µ)−1
and ∇µt
f(x; µ) to re-arrange their entries into
S(x)M(µ) [Hager and Belhumeur, 1998]. This process is not obvious at all
and it has been a frequent source of criticism for the HB algorithm [Baker
and Matthews, 2004]. However, we shall introduce procedures for systematic
factorization in Chapter 5
We outline the basic HB optimization in Algorithm 3.3; notice that the only
difference with the LK algorithm lies on the Jacobian computation. We depict the
differences more clearly in Figure 3.5: in the dissimilarity linearization stage we use
the derivatives of the template instead of the frame.
Algorithm 3 Outline of the Hager-Belhumeur algorithm.
Off-line: Let µi = µ0 be the initial guess.
1: Compute S(x)
On-line:
4: Compute the matrix M(µi)
5: Compute the Jacobian: J(µi) = S(x)M(µi)
6: Compute the search direction: δµi = − J(µi)⊤
J(µi)
−1
J(µi)⊤
r(µi).
7: Update the optimization parameters:µi+1 = µi + δµi.
8: end while
3.4 Compositional approaches
From Section 3.2.4 we recall the definition of compositional method: a GN-like
optimization method that updates the search parameters using function composition.
We review two compositional algorithms: the Forward Compositional (FC) and the
Inverse Compositional (IC), [Baker and Matthews, 2004].
A word about composition Function composition is usually defined as the ap-
plication of the results of a function onto another. Let f : X → Y, and
g : Y → Z be two function applications. We define the composite func-
tion g ◦ f : X → Z as (g ◦ f)(x) = g(f(x)). In the literature on image
registration the problem is posed as follows: Let f : R2
→ R2
be the tar-
get motion model parameterized by µ. We compose the target motion as
31

Figure 3.5: Hager-Belhumeur image registration. We initialize the current param-
eter estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous frame
It (µ0 ≡ µ∗
t ). We additionally create the matrix S(x) whose entries depend on the target
values. We compute the dissimilarity residuals between the Image and the Template using
µ (Equation 3.4). Instead of linearizing the residuals, we compute the Jacobian matrix
at µ using Equation 3.12, and we solve for the descent direction using Equation 3.9. We
additively update the search parameters using the search direction and we obtain an ap-
proximation to the minimum— i.e. µ1 = µ0 + δµ. We check if µ1 is a local minimizer
by using the brightness dissimilarity: if D is small enough, then µ1 is the local minimizer
(µ∗ ≡ µ1); in other case, we repeat the process with using µ1 as the current parameters
estimation (µ ≡ µ1).
32

z = f(f(x; µ1); µ2) = f(x; µ1 ◦ µ2) ≡ f(x; µ3), that is, the coordinates z
are the result of mapping x onto y = f(x; µ1) and y onto z = f(y; µ2). We
represent the composite parameters as µ3 = µ1 ◦ µ2 such that z = f(x; µ3).
3.4.1 Forward Compositional Algorithm
The FC algorithm was ﬁrst proposed in [Shum and Szeliski, 2000], although the
terminology was introduced in [Baker and Matthews, 2001]: FC is an optimization
algorithm, equivalent to the LK approach, that relies in a compositional update
step. Compositional algorithms for image registration uses a dissimilarity brightness
function slightly diﬀerent from Equation 3.3; we pose the image registration problem
as the following optimization:
µ∗
= arg min
µ
{D(X; µ)2
}, (3.14)
with
D(X; µ) = T (X) − It+1(f(f(X; µ); µt)), (3.15)
where µt comprises the optimal parameters at the image It. Note that our search
variables µ are those parameters that should be composed with the current estima-
tion to yield the minimum. The residuals corresponding to Equation 3.15 are
r(µ) ≡ T(x) − It+1(f(f(x; µ); µt)), (3.16)
As in the LK algorithm, we compute the linear model of the residuals, but now at
the point µ = 0 in the search space:
r(0 + δµ) ≃ ℓ(δµ) ≡ r(0) + r′
(0)δµ
= r(0) + J(0)δµ,
(3.17)
where
r(0) ≡ T(x) − It+1(f(f(x; 0); µt)),
and J(0) ≡
∂It+1(f(f(x; ˆµ); µt)
∂ ˆµ ˆµ=0
.
(3.18)
Notice that, in this case, µt acts as a constant in the derivative. Again, the local
minimizer is
δµ = − J(0)⊤
J(0)
−1
J(0)⊤
r(0). (3.19)
We iterate the above procedure until convergence. The next point in the iterative
series is not computed as µt+1 = µt +δµ, but as µt+1 = µt ◦δµ to be coherent with
Equation 3.16. Also notice that the Jacobian J(0) (Equation 3.18) is not constant
as it depends both in the image It+1 and the parameters µt. Figure 3.6 shows a
graphical depiction of the algorithm that is outlined in Algorithm 4.
33

Algorithm 4 Outline of the Forward Compositional algorithm.
3: Linearize the dissimilarity: J = ∇ˆµr(0), using Equation 3.18.
4: Compute the search direction: δµi = − J(0)⊤
J(0)
−1
J(0)⊤
r(0).
5: Update the optimization parameters:µi+1 = µi ◦ δµi.
6: end while
Figure 3.6: Forward compositional image registration. We initialize the current
parameter estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous
frame It (µ0 ≡ µ∗
t ). We compute the dissimilarity residuals between the Image and
the Template using µ (Equation 3.15). We linearize the residuals at µ = 0 and we
compute the descent direction δµ using Equation 3.19. We update the parameters using
function composition— i.e. µ1 = µ0 ◦ δµ. We check if µ1 is a local minimizer by using
the brightness dissimilarity: if D (Equation 3.15) is small enough, then µ1 is the local
minimizer (µ∗ ≡ µ1); in other case, we repeat the process with using µ1 as the current
parameters estimation (µ ≡ µ1).
34

3.4.2 Inverse Compositional Algorithm
The IC algorithm reinterprets the FC optimization scheme by changing the roles
of the template and the image. The key feature of IC is that its GN Jacobian is
constant: we compute the Jacobian using only template brightness values, therefore
it is constant. Using a constant Jacobian speeds up the whole computation as
the linearization stage is the most critical in time. The IC algorithm receives its
name because we reverse the roles of the template and the current frame (i.e. we
compute the Jacobian on the template). We rewrite the residuals function from FC
(Equation 3.16) as follows:
r(µ) ≡ T(f(x; µ)) − It+1(f(x; µt)), (3.20)
yielding the residuals for IC. Notice that the template brightness values now depend
on the search parameters µ. We linearize the Equation 3.20 around the point µ = 0
in the search space:
r(0 + δµ) ≃ ℓ(δµ) ≡ r(0) + r′
(0)δµ
= r(0) + J(0)δµ,
(3.21)
where
r(0) ≡ T(f(x; 0)) − It+1(f(x; µt)),
and J(0) ≡
∂T(f(x; ˆµ))
∂ ˆµ ˆµ=0
.
(3.22)
We compute the local minimizer of Equation 3.7 by deriving it respect δµ and
equalling to zero,
0 = L′
(δµ) = ∇δµ r(0)⊤
r(0) + 2δµ⊤
J(0)r(0) + δµ⊤
J(0)⊤
J(0)δµ
= J(0)r(0) + J(0)⊤
J(0)δµ.
(3.23)
Again, we obtain an approximation to the local minimum at
δµ = − J(0)⊤
J(0)
−1
J(0)⊤
r(0), (3.24)
which we iteratively refine until we find a suitable solution. We summarize the
optimization process in Algorithm 5 and Figure 3.7.
Note that the Jacobian matrix J(0) is constant as it is computed on the template
image—which is fixed—at the point µ = 0 (cf. Equation 3.22). Notice that the
crucial point of the derivation of the algorithm lies in the change of variables in
Equation 3.20. Solving for the search direction only consists on computing the
IC residuals and computing the least-squares approximation (Equation 3.24). The
Dissimilarity Linearization stage from Algorithm 1 is no longer required, which
results in a boost of the performance of the algorithm.
35

Algorithm 5 Outline of the Inverse Compositional algorithm.
Oﬀ-line: Compute J(0) = ∇µr(0) using Equation 3.22.
3: Compute the search direction: δµi = − J(0)⊤
J(0)
−1
J(0)⊤
r(0).
4: Update the optimization parameters:µi+1 = µi ◦ δµ−1
i .
5: end while
Figure 3.7: Inverse compositional image registration. We initialize the current
parameter estimation at frame It+1 (µ ≡ µ0) using the local minimizer at the previous
frame It (µ0 ≡ µ∗
t ). At this point we compute the Jacobian J(0) using Equation 3.22.
We compute the dissimilarity residuals between the Image and the Template using µ
(Equation 3.15). Using J(0) we compute the descent direction δµ (Equation 3.24). We
update the parameters using inverse function composition— i.e. µ1 = µ0 ◦ δµ−1
. We
check if µ1 is a local minimizer by using the brightness dissimilarity: if D (Equation 3.15)
is small enough, then µ1 is the local minimizer (µ∗ ≡ µ1); in other case, we repeat the
process with using µ1 as the current parameters estimation (µ ≡ µ1).
36

Relevance of IC
The IC algorithm is known to be the most efficient optimization technique for direct
image registration [Baker and Matthews, 2004]. The algorithm was initially pro-
posed for template tracking, although it was later improved to use aams [Matthews
and Baker, 2004], register 3D Morphable Models [Romdhani and Vetter, 2003; Xu
and Roy-Chowdhury, 2008], account for photometric changes [Bartoli, 2008] and
allow for appearance variation [Gonzalez-Mora et al., 2009].
Some efficient algorithms using a constant residual Jacobian with additive in-
crements have been proposed in literature but no one shows reliable performance:
in [Cootes et al., 2001] an iterative regression-based gradient scheme is proposed to
align AAM to frontal images of faces. The regression matrix (similar to our Jaco-
bian matrix) is numerically computed off-line and it remains constant during the
Gauss-Newton optimisation. The method shows good performance because the so-
lution does not depart far from the initial guess. The method is revisited in [Donner
et al., 2006] using Canonical Correlation Analysis instead of numerical differentia-
tion to achieve better convergence rate and range. In [La Cascia et al., 2000] the
authors propose a Gauss-Newton scheme with constant Jacobian matrix for 6-dof
3D tracking of heads. The method needs regularisation constraints to improve the
convergence of the optimisation.
Recently, [Brooks and Arbel, 2010] augmented the scope of the IC framework
with the Generalized Inverse Compositional (GIC) image registration: they propose
an additive update to the parameters that is equivalent to the compositional update
from IC; therefore, they can adapt the IC to other optimization methods than GN,
such as Broyden-Fletche-Goldfarb-Shanno (bfgs) [Press et al., 1992].
3.5 Other Methods
Iterative gradient-based optimization algorithms (see Figure 3.4) can improve their
efficiency in two different ways: (1) by speeding up the linearization of the dissim-
ilarity function, and (2) by reducing the number of iterations of the process. The
algorithms that we have presented—i.e. HB and IC—belong to the first type. The
second type of methods achieve efficiency by using a more involved linearization
that converges faster to the minimum. [Averbuch and Keller, 2002] approximates
the error function in both the template and the current image and average the
least-squares solution to both. They show it converges with less iterations than
LK although the time per iteration is higher. Malis et. al [Benhimane and Malis,
2007] propose a similar method called Efficient Second-Order Minimization (esm)
which differs from the latter in using an efficient linearization on the template by
means of Lie algebra properties. Recently, both methods have been revisited and
reformulated in a common Bi-directional Framework in [Megret et al., 2008]. [Keller
and Averbuch, 2008] derives a high-order approximation to the error function that
leads to a faster algorithm with a wider convergence basin. Unfortunately—with
the exception of esm—none of these algorithm are appropriate for real-time image
37

registration.
3.6 Summary
We have introduced the basic concepts on direct image registration. We pose the reg-
istration problem as the result of gradient-descent optimizing a dissimilarity function
based on brightness diﬀerences. We classify the direct image registration algorithms
as either additive or compositional: in the former group we highlight the LK and the
HB algorithms, whereas the FC and IC algorithms belong to the latter.
38

Chapter 4
Equivalence of Gradients
In this chapter we introduce the concept of Equivalence of Gradients, that is, the
process of replacing the gradient of a brightness function for an equivalent alterna-
tive. In chapter 3 we have shown that some efficient algorithms for direct image
registration use a gradient replacement technique as a basis for their speed improve-
ment: (1) HB algorithm transforms the template derivatives using the target warp to
yield the image derivatives; and (2) IC algorithm replaces the image derivatives by
the template derivatives without any modification, but they change the parameters
update rule so the GN-like optimization converges. We introduce a new constraint,
the Gradient Equivalence Equation, and we show that this constraint is a necessary
requirement for the high computational efficiency of both HB and IC algorithms.
We organize the chapter as follows: Section 4.1 introduces the basic concepts
on image gradients in R2
, and its extension to higher dimension spaces such as P2
and R3
; Section 4.2 introduces the Gradient Equivalence Equation, that shall be
subsequently used to impose some requirements on the registration algorithms.
4.1 Image Gradients
We introduce the concept of gradient of a scalar function below. We consider images
as functions in two dimensions that assign a brightness value to an image pixel
position.
The Concept of Gradient The gradient of a scalar function f : Rn
→ R at a
point x ∈ Rn
is a vector ∇f(x) ∈ Rn
that points towards the direction of greatest
rate of increase of f(x). The length of the gradient vector |∇f(x)| is the greatest
rate of change of the function.
Image Gradients Grayscale images are discrete scalar functions I : R2
→ R
ranging from 0 (black) to 255 (white)—see Figure 4.1. We turn our attention to
grayscale images, but we may deal with colour-channelled images (e.g. rgb images)
by simply considering them as one grayscale image per colour plane. Grayscale
39

images are discrete functions: we represent an image as a matrix whose elements
I(i, j) are the brightness function values. We continuously approximate the discrete
function by using interpolation (see Figure 4.1).
We introduce the image gradients in the most common domains in Computer
Vision—R2
, P2
, and R3
. Image gradients are naturally defined in R2
, since the
images are functions defined in such domain. In some Computer Vision applications
the domain of x, D, is not constrained to R2
, but to P2
[Buenaposada and Baumela,
2002; Cobzas et al., 2009], or to R3
[Sepp, 2006; Xu and Roy-Chowdhury, 2008]. In
the following, the target coordinates are expressed in a domain D ∈ {R3
, P2
}, so
we need a projection function to map the target coordinates onto the image. We
generically define the projection mapping as p : D → R2
.
The corresponding projectors are the homogeneous to Cartesian mapping, p :
P2
→ R2
, and the perspective projection, p : R3
→ R2
. Image gradients in domains
other than R2
are computed by using the chain rule with the projector p : Rn
→ R2
:
∇ˆx(I ◦ p(x)) = ∇ˆxI(p(x)) = ∇ˆxI(x)∇ˆxp(x),
=

 ∂I( ˆX)
∂ ˆX ˆX=p(x)

 ∂p( ˆY)
∂ ˆY ˆY=x
,
= ∇ ˆp(x)I(p(x))∇ˆxp(x), x ∈ D ⊂ Rn
.
(4.1)
Equation 4.1 represents image gradients in domain D as the image gradient in
R2
lifted up onto the higher-dimension space D by means of the Jacobian matrix
∇ˆxp(x).
Notation We use operator [ ] to denote the composite function I ◦ p, that is,
I(p(x)) = I[x].
If the target and its kinematics are expressed in R2
, there is no need to use a
projector as both the target and the image share a common reference frame. The
gradient of a grayscale image at point x = (i, j)⊤
is the vector
∇ˆxI(x) = (∇iI(x), ∇jI(x)) =
∂I(x)
∂i
,
∂I(x)
∂j
, (4.2)
that flows from the darker areas of the image to the brighter ones (see Figure 4.1).
Moreover, the direction of the gradient vector at point x ∈ R2
is orthogonal to the
level set of the brightness function at the point (see Figure 4.1).
40

Figure 4.1: Depiction of Image Gradients. (Top-left) An image is a rectangular
array where each element is a brightness value. (Top-right) Continuous representation
of the image brightness values; we compute the values from the discrete array by interpo-
lation. (Bottom-left) Image gradients are vectors from each image array element in the
direction of maximum increase of brightness (compare to the top-right image). (Bottom-
right) Gradient vectors are orthogonal to the brightness function contour curves. Legend:
blue Gradient vectors. diﬀerent colours Contour curves.
41

Efficient Model-based 3D Tracking by Using Direct Image Registration

Efficient Model-based 3D Tracking by Using Direct Image Registration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Efficient Model-based 3D Tracking by Using Direct Image Registration

Similar to Efficient Model-based 3D Tracking by Using Direct Image Registration (20)

Recently uploaded

Recently uploaded (20)

Efficient Model-based 3D Tracking by Using Direct Image Registration