Image Analysis & Retrieval
CS/EE 5590 Special Topics (Class Ids: 44873,44874)
Fall 2016,M/W 4-5:15pm@Bloch0012
Lec 08
Feature Aggregation II:
Fisher Vector, Super Vector and AKULA
Zhu Li
Dept of CSEE, UMKC
Office: FH560E,Email: lizhu@umkc.edu, Ph: x 2346.
http://l.web.umkc.edu/lizhu
p.1Z. Li, Image Analysis&Retrv.2016
Outline
 ReCap of Lecture 07
 Image Retrieval System
 BoW
 VLAD
 Dense SIFT
 Fisher Vector Aggregation
 AKULA
 Summary
Z. Li, Image Analysis&Retrv.2016 p.2
Precision, Recall, F-measure
Precision, TPR = TP/(TP + FP),
Recall = TP/(TP + FN),
 FPR=FP/(TP+FP)
F-measure
= 2*(precision*recall)/(precision +
recall)
Precision:
is the probability that a
retrieved document
is relevant.
Recall:
is the probability that a
relevant document
is retrieved in a search.
Z. Li, Image Analysis&Retrv.2016 p.3
Why Aggregation ?
 Curse of Dimensionality
Decision Boundary / Indexing
Z. Li, Image Analysis&Retrv.2016 p.4
+
…..
Bag-of-Words: Histogram Coding
Codebook:
 Feature space: Rd, k-means to get k centroids, {𝜇1, 𝜇2, … , 𝜇 𝑘}
 BoW Hard Encoding:
 For n feature points,{x1, x2, …,xn} assignment matrix: kxn,
with column only 1-non zero entry
 Aggregated dimension: k
Z. Li, Image Analysis&Retrv.2016 p.5
k
n
Kernel Code Book Soft Encoding
Kernel Code Book Soft Encoding
 Kernel Affinity: 𝐾 𝑥𝑗, 𝜇 𝑘 = 𝑒−𝑘|𝑥 𝑗−𝜇 𝑘|2
 Assignment Matrix: 𝐴𝑗,𝑘 = 𝐾(𝑥𝑗, 𝜇 𝑘)/ 𝑘 𝐾(𝑥𝑗, 𝜇 𝑘)
 Encoding: k-dimensional: X(k)=
1
𝑛 𝑗 𝐴𝑗,𝑘
Z. Li, Image Analysis&Retrv.2016 p.6
VLAD- Vector of Locally Aggregated Descriptors
 Aggregate feature difference
from the codebook
 Hard assignment by finding the
NN of feature {xk} to {𝜇 𝑘}
 Compute aggregated
differences
 L2 normalize
 Final feature: k x d
Z. Li, Image Analysis&Retrv.2016 p.7
 3
x
v1 v2
v3 v4
v5
1
 4
 2
 5
① assign descriptors
② compute x-  i
③ vi=sum x-  i for cell i
𝑣 𝑘 =
∀𝑗,𝑠.𝑡.𝑁𝑁 𝑥 𝑗 =𝜇 𝑘
𝑥𝑗 − 𝜇 𝑘
𝑣 𝑘 = 𝑣 𝑘/ 𝑣 𝑘 2
VLAD on SIFT
 Example of aggregating SIFT with VLAD
 K=16 codebook entries
 Each cell is a SIFT visualized as centroids in blue, and VLAD
difference in red
 Top row: left image, bottom row: right image, red: code
book, blue: encoded VLAD
Z. Li, Image Analysis&Retrv.2016 p.8
Outline
 ReCap of Lecture 07
 Image Retrieval System
 BoW
 VLAD
 Dense SIFT
 Fisher Vector Aggregation
 AKULA
 Summary
Z. Li, Image Analysis&Retrv.2016 p.9
One more trick
 Recall that SIFT is a powerful descriptor
 VL_FEAT: vl_dsift
 A dense description of image by computing SIFT descriptor
(no spatial-scale space extrema detection) at predetermined
grid
 Supplement HoG as an alternative texture descriptor
Z. Li, Image Analysis&Retrv.2016 p.10
VL_FEAT: vl_dsift
 Compute dense SIFT as a texture descriptor for the
image
 [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘step’, 2);
 There’s also a FAST option
 [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘fast’, ‘step’, 2);
 Huge amount of SIFT data will be generated
Z. Li, Image Analysis&Retrv.2016 p.11
Fisher Vector
 Fisher Vector and variations:
 Winning in image classification:
 Winning in the MPEG object re-identification:
o SCFV(Scalable Coded Fisher Vec) in CDVS
Z. Li, Image Analysis&Retrv.2016 p.12
Codebook: Gaussian Mixture Model (GMM)
 GMM is a generative model to express data
 Assuming data is generated from with parameters {𝑤 𝑘, 𝜇 𝑘, 𝜎 𝑘}
Z. Li, Image Analysis&Retrv.2016 p.13
𝑥 𝑘 ~
𝑘=1
𝐾
𝑤 𝑘 𝑁(𝜇 𝑘, 𝜎 𝑘)
𝑁 𝜇 𝑘, 𝜎 𝑘 =
1
2𝜋
𝑑
2 Σ 𝑘
1/2
𝑒−
1
2
𝑥−𝜇 𝑘
′Σ 𝑘
−1
(𝑥−𝜇 𝑘)
A bit of Theory: Fisher Kernel
Encode the derivation from the generative model
 Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 for
SIFT.
 How’s these observations derivate from the given GMM
model with a set of parameter, 𝜆 = 𝑤 𝑘, 𝜇 𝑘, 𝜎 𝑘 ?
o i.e, how the parameter, e.g, mean will move to best fit the observation
?
Z. Li, Image Analysis&Retrv.2016 p.14
𝜇4
𝜇3
𝜇2
𝜇1
X1
+
A bit of Theory: Fisher Kernel
Score function w.r.t. the likelihood function 𝜇 𝜆(𝑋)
 𝐺𝜆
𝑋
= 𝛻𝜆 log 𝑢 𝜆(𝑋): derivative on the log likelihood
 The dimension of score function is m, where m is the number
of generative model parameters, m=3 for GMM
 Given the observed data X, score function indicate how
likelihood function parameter (e.g, mean) should move to
better fit the data.
Distance/Derivation of two observation X, Y w.r.t the
generative model
 Fisher Info Matrix (roughly the covariance in the Mahanolibis
distance)
𝐹𝜆 = 𝐸 𝑋 𝐺𝜆
𝑋
𝐺𝜆
𝑋′
 Fisher Kernel Distance: normalized by the Fisher Info
Matrix:
Z. Li, Image Analysis&Retrv.2016 p.15
𝐾𝐹𝐾 𝑋, 𝑌 = 𝐺𝜆
𝑋′
𝐹𝜆
−1
𝐺𝜆
𝑋
Fisher Vector
 KFK(X, Y) is a measure of similarity,
w.r.t. the generative model
 Similar to the Mahanolibis distance case,
we can decompose this kernel as,
 That give us a kernel feature mappingof
X to Fisher Vector
 For observed images features {xt}, can
be computed as,
Z. Li, Image Analysis&Retrv.2016 p.16
𝐾𝐹𝐾 𝑋, 𝑌 = 𝐺𝜆
𝑋′
𝐹𝜆
−1
𝐺𝜆
𝑋
= 𝐺𝜆
𝑋′
𝐿 𝜆′𝐿 𝜆 𝐺𝜆
𝑋
GMM Fisher Vector
Encode the derivation from the generative model
 Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 (!) for SIFT.
 How’s these observations derivate from the given GMM model with a set
of parameter, 𝜃 = 𝑎 𝑘, 𝜇 𝑘, 𝜎 𝑘 ?
 GMM Log Likelihood Gradient
 Let 𝑤 𝑘 =
𝑒 𝑎 𝑘
𝑗 𝑒
𝑎 𝑗
, Then we have
Z. Li, Image Analysis&Retrv.2016 p.17
weight
mean
variance
GMM Fisher Vector VL_FEAT implementation
 GMM codebook
 For a K-component GMM, we only allow 3K parameters,
𝜋 𝑘, 𝜇 𝑘, 𝜎 𝑘 𝑘 = 1. . 𝐾}, i.e, iid Gaussian component
 Posterior prob of feature point xi to GMM component k
Z. Li, Image Analysis&Retrv.2016 p.18
Σ 𝑘 =
𝜎 𝑘 0 0 0
0 𝜎 𝑘 0 0
….
𝜎 𝑘
GMM Fisher Vector VL_FEAT implementation
 FV encoding
 Gradient on the mean, for GMM component k, j=1..D
 In the end, we have 2K x D aggregation on the derivation
w.r.t. the means and variances
Z. Li, Image Analysis&Retrv.2016 p.19
𝐹𝑉 = [𝑢1, 𝑢2,… , 𝑢 𝐾, 𝑣1, 𝑣2, … , 𝑣 𝐾]
VL_FEAT GMM/FV API
 Compute GMM model with VL_FEAT
 Prepare data:
numPoints = 1000 ; dimension = 2 ;
data = rand(dimension,N) ;
 Call vl_gmm:
numClusters = 30 ;
[means, covariances, priors] = vl_gmm(data, numClusters) ;
 Visualize:
figure ;
hold on ;
plot(data(1,:),data(2,:),'r.') ;
for i=1:numClusters
vl_plotframe([means(:,i)' sigmas(1,i) 0 sigmas(2,i)]);
end
Z. Li, Image Analysis&Retrv.2016 p.20
VL_FEAT API
 FV encoding
encoding = vl_fisher(datatoBeEncoded, means, covariances,
priors);
 Bonus points:
 Encode HoG features with Fisher Vector ?
 randomly collect 2~3 images from each class
 Stack all HoG features together into an n x 36 data matrix
 Compute its GMM
 Use this GMM to encode all image HoG features (other than
average)
Z. Li, Image Analysis&Retrv.2016 p.21
Super Vector Aggregation – Speaker ID
 Fisher Vector: Aggregates Features against a GMM
 Super Vector: Aggregates GMM against GMM
 Ref:
o William M. Campbell, Douglas E. Sturim, Douglas A. Reynolds: Support vector
machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett.
13(5): 308-311(2006)
Z. Li, Image Analysis&Retrv.2016 p.22
“Yes, We Can !”
?
Super Vector from MFCC
 Motivated from Speaker ID work
 Speech is a continuousevolution of the vocal tract
 Need to extract a sequence of spectra or sequence of spectral coefficients
 Use a sliding window - 25 ms window, 10 ms shift
Z. Li, Image Analysis&Retrv.2016 p.23
DCTLog|X(ω)|
MFCC
GMM Model from MFCC
 GMM on MFCC feature
Z. Li, Image Analysis&Retrv.2016 p.24


M
j
s
j
s
j
s
j
s
pp
1
)()()()(
),|()|(  xx
• The acoustic vectors (MFCC) of speaker s is modeled by a
prob. density function parameterized by
M
j
s
j
s
j
s
j
s
1
)()()()(
},,{  
• Gaussian mixture model (GMM) for speaker s:
M
j
s
j
s
j
s
j
s
1
)()()()(
},,{  
Universal Background Model
 UBM GMM Model:
Z. Li, Image Analysis&Retrv.2016 p.25


M
j
jjj pp
1
)ubm()ubm()ubm()ubm(
),|()|(  xx
• The acoustic vectors of a general population is modeled by
another GMM called the universal background model
(UBM):
• Parameters of the UBM
M
jjjj 1
)ubm()ubm()ubm()ubm(
},,{  
MAP Adaption
 Given the UBM GMM, how is the new observation
derivate ?
 The adapted mean is given by:
Z. Li, Image Analysis&Retrv.2016 p.26
Supervector Distance
 Assuming we have UBM GMM model
𝜆 𝑈𝐵𝑀 = {𝑃𝑘, 𝜇 𝑘, Σ 𝑘},
with identical prior and covariance
Then for two utterance samples a and b, with GMM models
 𝜆 𝑎 = {𝑃𝑘, 𝜇 𝑘
𝑎
, Σ 𝑘},
 𝜆 𝑏 = {𝑃𝑘, 𝜇 𝑘
𝑏
,Σ 𝑘},
The SV distance is,
It means the means of two models need to be normalized by the UBM
covariance induced Mahanolibis distance metric
This is also a linear kernel function scaled by the UBM covariances
Z. Li, Image Analysis&Retrv.2016 p.27
𝐾 𝜆 𝑎, 𝜆 𝑏 =
𝑘
𝑃𝑘Σ 𝑘
−(
1
2
)
𝜇 𝑘
𝑎
𝑇
( 𝑃𝑘Σ 𝑘
−(
1
2
)
𝜇 𝑘
𝑏)
Supervector Performance in NIST Speaker ID
 System 5: Gaussian SV
 DCF (Detection Cost Function)
Z. Li, Image Analysis&Retrv.2016 p.28
m31491
AKULA – Adaptive KLUster Aggregation
2013/10/25
Abhishek Nagar, Zhu Li, Gaurav Srivastava and Kyungmo Park
Z. Li, Image Analysis&Retrv.2016 p.29
Outline
Motivation
Adaptive Aggregation
Results with TM7
Summary
Z. Li, Image Analysis&Retrv.2016 p.30
Motivation
Better Aggregation
 Fisher Vector and VLAD type aggregation depending on a
global model
 AKULA removes this dependence, and directly coding the
cluster centroids and sift count
 SCFV/RVD all having situations where clusters are turned off
due to no assignment, this can be avoided in AKULA
SIFTdetection & selection K-means AKULA description
Z. Li, Image Analysis&Retrv.2016 p.31
Motivation
Better Subspace Choice
 Both SCFV and RVD do fixed normalization and PCA
projection based on heuristic.
 What is the best possible subspace to do the aggregation ?
 Using a boosting scheme to keep adding subspaces and
aggregations in an iterative fashion, and tune TPR-FPR to
the desired operating points on FPR.
Z. Li, Image Analysis&Retrv.2016 p.32
CE2: AKULA – Adaptive KLUster Aggregation
AKULA Descriptor: cluster centroids +
SIFT count
A2={yc2
1, yc2
2, …, yc2
k ; pc2
1, pc2
2, …, pc2
k }
Distance metric:
 Min centroids distance, weighted
by SIFT count
d A1 ,A2 =
1
𝑘 𝑗=0
𝑘
d 𝑚𝑖𝑛
1
𝑗 𝑤 𝑚𝑖𝑛
1
(𝑗) +
1
𝑘 𝑖=0
𝑘
d 𝑚𝑖𝑛
2
𝑖 𝑤 𝑚𝑖𝑛
2
(𝑖)
A1={yc1
1, yc1
2, …, yc1
k ; pc1
1, pc1
2, …, pc1
k },
d 𝑚𝑖𝑛
1
𝑗 = min
𝑖
𝑑𝑗,𝑖
d 𝑚𝑖𝑛
2
𝑖 = min
𝑗
𝑑𝑗,𝑖
w 𝑚𝑖𝑛
1
𝑗 = 𝑤𝑗,𝑖∗ , 𝑖∗ = 𝑎𝑟𝑔min
𝑖
𝑑𝑗,𝑖
w 𝑚𝑖𝑛
2
𝑖 = 𝑤𝑗∗,𝑖, 𝑗∗ = 𝑎𝑟𝑔min
𝑗
𝑑𝑗,𝑖
Z. Li, Image Analysis&Retrv.2016 p.33
AKULA implementation in TM7
Inner loop aggregation
 Dimension is fixed at 8
 Numb of clusters, or nc=8, 16, 32, to hit 64, 128, and 256
bytes
 Quantization: scale by ½ and quantized to int8, sift count is
8 bits, total (nc+1)*dim bytes per aggregation
Z. Li, Image Analysis&Retrv.2016 p.34
AKULA implementation in TM7
Outer loop subspace optimization by boosting
 Initial set of subspace models {Ak} computed from MIR
FLICKR data set SIFT extractions by k-means the space to
4096 clusters
 Iterative search on subspaces to generate AKULA
aggregation that can improve performance in precision-
recall
 Notice that aggregation is de-coupled in subspace iteration,
to allow more DoF in aggregation, to find subspaces that
provides complimentary info.
The algorithm is still being debugged, hence only
having 1st iteration results in TM7
Z. Li, Image Analysis&Retrv.2016 p.35
AKULA implementation in TM7
Outer loop subspace optimization by boosting
 Initial set of subspace models {Ak} computed from MIR
FLICKR data set SIFT extractions by k-means the space to 4096
clusters
 Iterative search on subspaces to generate AKULA aggregation
that can improve performance in precision-recall
 Notice that aggregation is de-coupled in subspace iteration, to
allow more DoF in aggregation, to find subspaces that provides
complimentary info.
The algorithm is still being debugged, hence only having
1st iteration results in TM7
Indexing/Hashing is required for AKULA, it involves nc x
dim multiplications and additions at this time. A
binarization scheme will be considered once its
performance is optimized in non-binary form.
Z. Li, Image Analysis&Retrv.2016 p.36
GD Only TPR-FPR: AKULA vs SCFV
Data set 1:
 AKULA (128bytes, dim=8, nc=16) distance is just 1-way
dmin1.*wt
 Forcing a weighted sum on SCFV (512 bytes) hamming
distances without 2D decision fitting, i.e, count hamming
distance between common active clusters, and sum up their
distances
Z. Li, Image Analysis&Retrv.2016 p.37
GD Only TPR-FPR: AKULA vs SCFV
Data set 2, 3:
 AKULA distance is just 1-way dmin1.*wt
 AKULA=128bytes, SCFV = 512 bytes.
Z. Li, Image Analysis&Retrv.2016 p.38
3D object set: 4 , 5
Data set4, 5:
Z. Li, Image Analysis&Retrv.2016 p.39
AKULA in PM
FPR performance:
AKULA rates:
pm rates m akula rates
512 8 64
1K 16 128
2K 16 128
1K_4K 16 128
2K_4K 16 128
4K 16 128
8K 32 256
16K 32 256
Z. Li, Image Analysis&Retrv.2016 p.40
TPR@1% FPR
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:512
TM7
AKULA
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5TPR(%)
bitrate:1k
TM7
AKULA
Z. Li, Image Analysis&Retrv.2016 p.41
TPR@1%FPR:
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:2k
TM7
AKULA
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:1k-4k
TM7
AKULA
Z. Li, Image Analysis&Retrv.2016 p.42
TPR@1%FPR:
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:2k-4k
TM7
AKULA
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:4k
TM7
AKULA
Z. Li, Image Analysis&Retrv.2016 p.43
TPR@1%FPR:
75
80
85
90
95
100
105
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:8k
TM7
AKULA
80
85
90
95
100
105
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:16k
TM7
AKULA
Z. Li, Image Analysis&Retrv.2016 p.44
AKULA Localization
Quite some improvements: 2.7%
Z. Li, Image Analysis&Retrv.2016 p.45
AKULA Summary
Benefits:
 Allow more DoF in aggregation optimization,
o by an outer loop boosting scheme for subspace projection optimization
o And an inner loop adaptive clustering without the constraint of the
global GMM model
 Simple weighted distance sum metric, with no need to tune a
multi-dimensional decision boundary
 The overall pair wise matching matched up with TM7 SCFV
with 2-dimensional decision boundary
 In GD only matching outperforms the TM7 GD
 Good improvements to the localization accuracy
 Light in extraction, but still heavy in pair wise matching, and
need binarization scheme and/or indexing scheme to work for
retrieval
 Future Improvements:
 SupervectorAKULA ?
Z. Li, Image Analysis&Retrv.2016 p.46
Lec 08 Summary
 Fisher Vector
 Aggregate features {Xk} in RD
against GMM
Super Vector
 Aggregate GMM against a global
GMM (UBM)
 AKULA
 Direct Aggregation
Z. Li, Image Analysis&Retrv.2016 p.47
+
+ + +

Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector

  • 1.
    Image Analysis &Retrieval CS/EE 5590 Special Topics (Class Ids: 44873,44874) Fall 2016,M/W 4-5:15pm@Bloch0012 Lec 08 Feature Aggregation II: Fisher Vector, Super Vector and AKULA Zhu Li Dept of CSEE, UMKC Office: FH560E,Email: lizhu@umkc.edu, Ph: x 2346. http://l.web.umkc.edu/lizhu p.1Z. Li, Image Analysis&Retrv.2016
  • 2.
    Outline  ReCap ofLecture 07  Image Retrieval System  BoW  VLAD  Dense SIFT  Fisher Vector Aggregation  AKULA  Summary Z. Li, Image Analysis&Retrv.2016 p.2
  • 3.
    Precision, Recall, F-measure Precision,TPR = TP/(TP + FP), Recall = TP/(TP + FN),  FPR=FP/(TP+FP) F-measure = 2*(precision*recall)/(precision + recall) Precision: is the probability that a retrieved document is relevant. Recall: is the probability that a relevant document is retrieved in a search. Z. Li, Image Analysis&Retrv.2016 p.3
  • 4.
    Why Aggregation ? Curse of Dimensionality Decision Boundary / Indexing Z. Li, Image Analysis&Retrv.2016 p.4 + …..
  • 5.
    Bag-of-Words: Histogram Coding Codebook: Feature space: Rd, k-means to get k centroids, {𝜇1, 𝜇2, … , 𝜇 𝑘}  BoW Hard Encoding:  For n feature points,{x1, x2, …,xn} assignment matrix: kxn, with column only 1-non zero entry  Aggregated dimension: k Z. Li, Image Analysis&Retrv.2016 p.5 k n
  • 6.
    Kernel Code BookSoft Encoding Kernel Code Book Soft Encoding  Kernel Affinity: 𝐾 𝑥𝑗, 𝜇 𝑘 = 𝑒−𝑘|𝑥 𝑗−𝜇 𝑘|2  Assignment Matrix: 𝐴𝑗,𝑘 = 𝐾(𝑥𝑗, 𝜇 𝑘)/ 𝑘 𝐾(𝑥𝑗, 𝜇 𝑘)  Encoding: k-dimensional: X(k)= 1 𝑛 𝑗 𝐴𝑗,𝑘 Z. Li, Image Analysis&Retrv.2016 p.6
  • 7.
    VLAD- Vector ofLocally Aggregated Descriptors  Aggregate feature difference from the codebook  Hard assignment by finding the NN of feature {xk} to {𝜇 𝑘}  Compute aggregated differences  L2 normalize  Final feature: k x d Z. Li, Image Analysis&Retrv.2016 p.7  3 x v1 v2 v3 v4 v5 1  4  2  5 ① assign descriptors ② compute x-  i ③ vi=sum x-  i for cell i 𝑣 𝑘 = ∀𝑗,𝑠.𝑡.𝑁𝑁 𝑥 𝑗 =𝜇 𝑘 𝑥𝑗 − 𝜇 𝑘 𝑣 𝑘 = 𝑣 𝑘/ 𝑣 𝑘 2
  • 8.
    VLAD on SIFT Example of aggregating SIFT with VLAD  K=16 codebook entries  Each cell is a SIFT visualized as centroids in blue, and VLAD difference in red  Top row: left image, bottom row: right image, red: code book, blue: encoded VLAD Z. Li, Image Analysis&Retrv.2016 p.8
  • 9.
    Outline  ReCap ofLecture 07  Image Retrieval System  BoW  VLAD  Dense SIFT  Fisher Vector Aggregation  AKULA  Summary Z. Li, Image Analysis&Retrv.2016 p.9
  • 10.
    One more trick Recall that SIFT is a powerful descriptor  VL_FEAT: vl_dsift  A dense description of image by computing SIFT descriptor (no spatial-scale space extrema detection) at predetermined grid  Supplement HoG as an alternative texture descriptor Z. Li, Image Analysis&Retrv.2016 p.10
  • 11.
    VL_FEAT: vl_dsift  Computedense SIFT as a texture descriptor for the image  [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘step’, 2);  There’s also a FAST option  [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘fast’, ‘step’, 2);  Huge amount of SIFT data will be generated Z. Li, Image Analysis&Retrv.2016 p.11
  • 12.
    Fisher Vector  FisherVector and variations:  Winning in image classification:  Winning in the MPEG object re-identification: o SCFV(Scalable Coded Fisher Vec) in CDVS Z. Li, Image Analysis&Retrv.2016 p.12
  • 13.
    Codebook: Gaussian MixtureModel (GMM)  GMM is a generative model to express data  Assuming data is generated from with parameters {𝑤 𝑘, 𝜇 𝑘, 𝜎 𝑘} Z. Li, Image Analysis&Retrv.2016 p.13 𝑥 𝑘 ~ 𝑘=1 𝐾 𝑤 𝑘 𝑁(𝜇 𝑘, 𝜎 𝑘) 𝑁 𝜇 𝑘, 𝜎 𝑘 = 1 2𝜋 𝑑 2 Σ 𝑘 1/2 𝑒− 1 2 𝑥−𝜇 𝑘 ′Σ 𝑘 −1 (𝑥−𝜇 𝑘)
  • 14.
    A bit ofTheory: Fisher Kernel Encode the derivation from the generative model  Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 for SIFT.  How’s these observations derivate from the given GMM model with a set of parameter, 𝜆 = 𝑤 𝑘, 𝜇 𝑘, 𝜎 𝑘 ? o i.e, how the parameter, e.g, mean will move to best fit the observation ? Z. Li, Image Analysis&Retrv.2016 p.14 𝜇4 𝜇3 𝜇2 𝜇1 X1 +
  • 15.
    A bit ofTheory: Fisher Kernel Score function w.r.t. the likelihood function 𝜇 𝜆(𝑋)  𝐺𝜆 𝑋 = 𝛻𝜆 log 𝑢 𝜆(𝑋): derivative on the log likelihood  The dimension of score function is m, where m is the number of generative model parameters, m=3 for GMM  Given the observed data X, score function indicate how likelihood function parameter (e.g, mean) should move to better fit the data. Distance/Derivation of two observation X, Y w.r.t the generative model  Fisher Info Matrix (roughly the covariance in the Mahanolibis distance) 𝐹𝜆 = 𝐸 𝑋 𝐺𝜆 𝑋 𝐺𝜆 𝑋′  Fisher Kernel Distance: normalized by the Fisher Info Matrix: Z. Li, Image Analysis&Retrv.2016 p.15 𝐾𝐹𝐾 𝑋, 𝑌 = 𝐺𝜆 𝑋′ 𝐹𝜆 −1 𝐺𝜆 𝑋
  • 16.
    Fisher Vector  KFK(X,Y) is a measure of similarity, w.r.t. the generative model  Similar to the Mahanolibis distance case, we can decompose this kernel as,  That give us a kernel feature mappingof X to Fisher Vector  For observed images features {xt}, can be computed as, Z. Li, Image Analysis&Retrv.2016 p.16 𝐾𝐹𝐾 𝑋, 𝑌 = 𝐺𝜆 𝑋′ 𝐹𝜆 −1 𝐺𝜆 𝑋 = 𝐺𝜆 𝑋′ 𝐿 𝜆′𝐿 𝜆 𝐺𝜆 𝑋
  • 17.
    GMM Fisher Vector Encodethe derivation from the generative model  Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 (!) for SIFT.  How’s these observations derivate from the given GMM model with a set of parameter, 𝜃 = 𝑎 𝑘, 𝜇 𝑘, 𝜎 𝑘 ?  GMM Log Likelihood Gradient  Let 𝑤 𝑘 = 𝑒 𝑎 𝑘 𝑗 𝑒 𝑎 𝑗 , Then we have Z. Li, Image Analysis&Retrv.2016 p.17 weight mean variance
  • 18.
    GMM Fisher VectorVL_FEAT implementation  GMM codebook  For a K-component GMM, we only allow 3K parameters, 𝜋 𝑘, 𝜇 𝑘, 𝜎 𝑘 𝑘 = 1. . 𝐾}, i.e, iid Gaussian component  Posterior prob of feature point xi to GMM component k Z. Li, Image Analysis&Retrv.2016 p.18 Σ 𝑘 = 𝜎 𝑘 0 0 0 0 𝜎 𝑘 0 0 …. 𝜎 𝑘
  • 19.
    GMM Fisher VectorVL_FEAT implementation  FV encoding  Gradient on the mean, for GMM component k, j=1..D  In the end, we have 2K x D aggregation on the derivation w.r.t. the means and variances Z. Li, Image Analysis&Retrv.2016 p.19 𝐹𝑉 = [𝑢1, 𝑢2,… , 𝑢 𝐾, 𝑣1, 𝑣2, … , 𝑣 𝐾]
  • 20.
    VL_FEAT GMM/FV API Compute GMM model with VL_FEAT  Prepare data: numPoints = 1000 ; dimension = 2 ; data = rand(dimension,N) ;  Call vl_gmm: numClusters = 30 ; [means, covariances, priors] = vl_gmm(data, numClusters) ;  Visualize: figure ; hold on ; plot(data(1,:),data(2,:),'r.') ; for i=1:numClusters vl_plotframe([means(:,i)' sigmas(1,i) 0 sigmas(2,i)]); end Z. Li, Image Analysis&Retrv.2016 p.20
  • 21.
    VL_FEAT API  FVencoding encoding = vl_fisher(datatoBeEncoded, means, covariances, priors);  Bonus points:  Encode HoG features with Fisher Vector ?  randomly collect 2~3 images from each class  Stack all HoG features together into an n x 36 data matrix  Compute its GMM  Use this GMM to encode all image HoG features (other than average) Z. Li, Image Analysis&Retrv.2016 p.21
  • 22.
    Super Vector Aggregation– Speaker ID  Fisher Vector: Aggregates Features against a GMM  Super Vector: Aggregates GMM against GMM  Ref: o William M. Campbell, Douglas E. Sturim, Douglas A. Reynolds: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5): 308-311(2006) Z. Li, Image Analysis&Retrv.2016 p.22 “Yes, We Can !” ?
  • 23.
    Super Vector fromMFCC  Motivated from Speaker ID work  Speech is a continuousevolution of the vocal tract  Need to extract a sequence of spectra or sequence of spectral coefficients  Use a sliding window - 25 ms window, 10 ms shift Z. Li, Image Analysis&Retrv.2016 p.23 DCTLog|X(ω)| MFCC
  • 24.
    GMM Model fromMFCC  GMM on MFCC feature Z. Li, Image Analysis&Retrv.2016 p.24   M j s j s j s j s pp 1 )()()()( ),|()|(  xx • The acoustic vectors (MFCC) of speaker s is modeled by a prob. density function parameterized by M j s j s j s j s 1 )()()()( },,{   • Gaussian mixture model (GMM) for speaker s: M j s j s j s j s 1 )()()()( },,{  
  • 25.
    Universal Background Model UBM GMM Model: Z. Li, Image Analysis&Retrv.2016 p.25   M j jjj pp 1 )ubm()ubm()ubm()ubm( ),|()|(  xx • The acoustic vectors of a general population is modeled by another GMM called the universal background model (UBM): • Parameters of the UBM M jjjj 1 )ubm()ubm()ubm()ubm( },,{  
  • 26.
    MAP Adaption  Giventhe UBM GMM, how is the new observation derivate ?  The adapted mean is given by: Z. Li, Image Analysis&Retrv.2016 p.26
  • 27.
    Supervector Distance  Assumingwe have UBM GMM model 𝜆 𝑈𝐵𝑀 = {𝑃𝑘, 𝜇 𝑘, Σ 𝑘}, with identical prior and covariance Then for two utterance samples a and b, with GMM models  𝜆 𝑎 = {𝑃𝑘, 𝜇 𝑘 𝑎 , Σ 𝑘},  𝜆 𝑏 = {𝑃𝑘, 𝜇 𝑘 𝑏 ,Σ 𝑘}, The SV distance is, It means the means of two models need to be normalized by the UBM covariance induced Mahanolibis distance metric This is also a linear kernel function scaled by the UBM covariances Z. Li, Image Analysis&Retrv.2016 p.27 𝐾 𝜆 𝑎, 𝜆 𝑏 = 𝑘 𝑃𝑘Σ 𝑘 −( 1 2 ) 𝜇 𝑘 𝑎 𝑇 ( 𝑃𝑘Σ 𝑘 −( 1 2 ) 𝜇 𝑘 𝑏)
  • 28.
    Supervector Performance inNIST Speaker ID  System 5: Gaussian SV  DCF (Detection Cost Function) Z. Li, Image Analysis&Retrv.2016 p.28
  • 29.
    m31491 AKULA – AdaptiveKLUster Aggregation 2013/10/25 Abhishek Nagar, Zhu Li, Gaurav Srivastava and Kyungmo Park Z. Li, Image Analysis&Retrv.2016 p.29
  • 30.
    Outline Motivation Adaptive Aggregation Results withTM7 Summary Z. Li, Image Analysis&Retrv.2016 p.30
  • 31.
    Motivation Better Aggregation  FisherVector and VLAD type aggregation depending on a global model  AKULA removes this dependence, and directly coding the cluster centroids and sift count  SCFV/RVD all having situations where clusters are turned off due to no assignment, this can be avoided in AKULA SIFTdetection & selection K-means AKULA description Z. Li, Image Analysis&Retrv.2016 p.31
  • 32.
    Motivation Better Subspace Choice Both SCFV and RVD do fixed normalization and PCA projection based on heuristic.  What is the best possible subspace to do the aggregation ?  Using a boosting scheme to keep adding subspaces and aggregations in an iterative fashion, and tune TPR-FPR to the desired operating points on FPR. Z. Li, Image Analysis&Retrv.2016 p.32
  • 33.
    CE2: AKULA –Adaptive KLUster Aggregation AKULA Descriptor: cluster centroids + SIFT count A2={yc2 1, yc2 2, …, yc2 k ; pc2 1, pc2 2, …, pc2 k } Distance metric:  Min centroids distance, weighted by SIFT count d A1 ,A2 = 1 𝑘 𝑗=0 𝑘 d 𝑚𝑖𝑛 1 𝑗 𝑤 𝑚𝑖𝑛 1 (𝑗) + 1 𝑘 𝑖=0 𝑘 d 𝑚𝑖𝑛 2 𝑖 𝑤 𝑚𝑖𝑛 2 (𝑖) A1={yc1 1, yc1 2, …, yc1 k ; pc1 1, pc1 2, …, pc1 k }, d 𝑚𝑖𝑛 1 𝑗 = min 𝑖 𝑑𝑗,𝑖 d 𝑚𝑖𝑛 2 𝑖 = min 𝑗 𝑑𝑗,𝑖 w 𝑚𝑖𝑛 1 𝑗 = 𝑤𝑗,𝑖∗ , 𝑖∗ = 𝑎𝑟𝑔min 𝑖 𝑑𝑗,𝑖 w 𝑚𝑖𝑛 2 𝑖 = 𝑤𝑗∗,𝑖, 𝑗∗ = 𝑎𝑟𝑔min 𝑗 𝑑𝑗,𝑖 Z. Li, Image Analysis&Retrv.2016 p.33
  • 34.
    AKULA implementation inTM7 Inner loop aggregation  Dimension is fixed at 8  Numb of clusters, or nc=8, 16, 32, to hit 64, 128, and 256 bytes  Quantization: scale by ½ and quantized to int8, sift count is 8 bits, total (nc+1)*dim bytes per aggregation Z. Li, Image Analysis&Retrv.2016 p.34
  • 35.
    AKULA implementation inTM7 Outer loop subspace optimization by boosting  Initial set of subspace models {Ak} computed from MIR FLICKR data set SIFT extractions by k-means the space to 4096 clusters  Iterative search on subspaces to generate AKULA aggregation that can improve performance in precision- recall  Notice that aggregation is de-coupled in subspace iteration, to allow more DoF in aggregation, to find subspaces that provides complimentary info. The algorithm is still being debugged, hence only having 1st iteration results in TM7 Z. Li, Image Analysis&Retrv.2016 p.35
  • 36.
    AKULA implementation inTM7 Outer loop subspace optimization by boosting  Initial set of subspace models {Ak} computed from MIR FLICKR data set SIFT extractions by k-means the space to 4096 clusters  Iterative search on subspaces to generate AKULA aggregation that can improve performance in precision-recall  Notice that aggregation is de-coupled in subspace iteration, to allow more DoF in aggregation, to find subspaces that provides complimentary info. The algorithm is still being debugged, hence only having 1st iteration results in TM7 Indexing/Hashing is required for AKULA, it involves nc x dim multiplications and additions at this time. A binarization scheme will be considered once its performance is optimized in non-binary form. Z. Li, Image Analysis&Retrv.2016 p.36
  • 37.
    GD Only TPR-FPR:AKULA vs SCFV Data set 1:  AKULA (128bytes, dim=8, nc=16) distance is just 1-way dmin1.*wt  Forcing a weighted sum on SCFV (512 bytes) hamming distances without 2D decision fitting, i.e, count hamming distance between common active clusters, and sum up their distances Z. Li, Image Analysis&Retrv.2016 p.37
  • 38.
    GD Only TPR-FPR:AKULA vs SCFV Data set 2, 3:  AKULA distance is just 1-way dmin1.*wt  AKULA=128bytes, SCFV = 512 bytes. Z. Li, Image Analysis&Retrv.2016 p.38
  • 39.
    3D object set:4 , 5 Data set4, 5: Z. Li, Image Analysis&Retrv.2016 p.39
  • 40.
    AKULA in PM FPRperformance: AKULA rates: pm rates m akula rates 512 8 64 1K 16 128 2K 16 128 1K_4K 16 128 2K_4K 16 128 4K 16 128 8K 32 256 16K 32 256 Z. Li, Image Analysis&Retrv.2016 p.40
  • 41.
    TPR@1% FPR 0 20 40 60 80 100 120 1a 1b1c 2 3 4 5 TPR(%) bitrate:512 TM7 AKULA 0 20 40 60 80 100 120 1a 1b 1c 2 3 4 5TPR(%) bitrate:1k TM7 AKULA Z. Li, Image Analysis&Retrv.2016 p.41
  • 42.
    TPR@1%FPR: 0 20 40 60 80 100 120 1a 1b 1c2 3 4 5 TPR(%) bitrate:2k TM7 AKULA 0 20 40 60 80 100 120 1a 1b 1c 2 3 4 5 TPR(%) bitrate:1k-4k TM7 AKULA Z. Li, Image Analysis&Retrv.2016 p.42
  • 43.
    TPR@1%FPR: 0 20 40 60 80 100 120 1a 1b 1c2 3 4 5 TPR(%) bitrate:2k-4k TM7 AKULA 0 20 40 60 80 100 120 1a 1b 1c 2 3 4 5 TPR(%) bitrate:4k TM7 AKULA Z. Li, Image Analysis&Retrv.2016 p.43
  • 44.
    TPR@1%FPR: 75 80 85 90 95 100 105 1a 1b 1c2 3 4 5 TPR(%) bitrate:8k TM7 AKULA 80 85 90 95 100 105 1a 1b 1c 2 3 4 5 TPR(%) bitrate:16k TM7 AKULA Z. Li, Image Analysis&Retrv.2016 p.44
  • 45.
    AKULA Localization Quite someimprovements: 2.7% Z. Li, Image Analysis&Retrv.2016 p.45
  • 46.
    AKULA Summary Benefits:  Allowmore DoF in aggregation optimization, o by an outer loop boosting scheme for subspace projection optimization o And an inner loop adaptive clustering without the constraint of the global GMM model  Simple weighted distance sum metric, with no need to tune a multi-dimensional decision boundary  The overall pair wise matching matched up with TM7 SCFV with 2-dimensional decision boundary  In GD only matching outperforms the TM7 GD  Good improvements to the localization accuracy  Light in extraction, but still heavy in pair wise matching, and need binarization scheme and/or indexing scheme to work for retrieval  Future Improvements:  SupervectorAKULA ? Z. Li, Image Analysis&Retrv.2016 p.46
  • 47.
    Lec 08 Summary Fisher Vector  Aggregate features {Xk} in RD against GMM Super Vector  Aggregate GMM against a global GMM (UBM)  AKULA  Direct Aggregation Z. Li, Image Analysis&Retrv.2016 p.47 + + + +