3. Precision, Recall, F-measure
ο±Precision, TPR = TP/(TP + FP),
ο±Recall = TP/(TP + FN),
ο± FPR=FP/(TP+FP)
ο±F-measure
= 2*(precision*recall)/(precision +
recall)
Precision:
is the probability that a
retrieved document
is relevant.
Recall:
is the probability that a
relevant document
is retrieved in a search.
Z. Li, Image Analysis&Retrv.2016 p.3
4. Why Aggregation ?
ο± Curse of Dimensionality
ο±Decision Boundary / Indexing
Z. Li, Image Analysis&Retrv.2016 p.4
+
β¦..
5. Bag-of-Words: Histogram Coding
ο±Codebook:
ο§ Feature space: Rd, k-means to get k centroids, {π1, π2, β¦ , π π}
ο± BoW Hard Encoding:
ο§ For n feature points,{x1, x2, β¦,xn} assignment matrix: kxn,
with column only 1-non zero entry
ο§ Aggregated dimension: k
Z. Li, Image Analysis&Retrv.2016 p.5
k
n
7. VLAD- Vector of Locally Aggregated Descriptors
ο± Aggregate feature difference
from the codebook
ο§ Hard assignment by finding the
NN of feature {xk} to {π π}
ο§ Compute aggregated
differences
ο§ L2 normalize
ο§ Final feature: k x d
Z. Li, Image Analysis&Retrv.2016 p.7
ο 3
x
v1 v2
v3 v4
v5
ο1
ο 4
ο 2
ο 5
β assign descriptors
β‘ compute x- ο i
β’ vi=sum x- ο i for cell i
π£ π =
βπ,π .π‘.ππ π₯ π =π π
π₯π β π π
π£ π = π£ π/ π£ π 2
8. VLAD on SIFT
ο± Example of aggregating SIFT with VLAD
ο§ K=16 codebook entries
ο§ Each cell is a SIFT visualized as centroids in blue, and VLAD
difference in red
ο§ Top row: left image, bottom row: right image, red: code
book, blue: encoded VLAD
Z. Li, Image Analysis&Retrv.2016 p.8
10. One more trick
ο± Recall that SIFT is a powerful descriptor
ο± VL_FEAT: vl_dsift
ο§ A dense description of image by computing SIFT descriptor
(no spatial-scale space extrema detection) at predetermined
grid
ο§ Supplement HoG as an alternative texture descriptor
Z. Li, Image Analysis&Retrv.2016 p.10
11. VL_FEAT: vl_dsift
ο± Compute dense SIFT as a texture descriptor for the
image
ο§ [f, dsift]=vl_dsift(single(rgb2gray(im)), βstepβ, 2);
ο± Thereβs also a FAST option
ο§ [f, dsift]=vl_dsift(single(rgb2gray(im)), βfastβ, βstepβ, 2);
ο§ Huge amount of SIFT data will be generated
Z. Li, Image Analysis&Retrv.2016 p.11
12. Fisher Vector
ο± Fisher Vector and variations:
ο§ Winning in image classification:
ο§ Winning in the MPEG object re-identification:
o SCFV(Scalable Coded Fisher Vec) in CDVS
Z. Li, Image Analysis&Retrv.2016 p.12
13. Codebook: Gaussian Mixture Model (GMM)
ο± GMM is a generative model to express data
ο§ Assuming data is generated from with parameters {π€ π, π π, π π}
Z. Li, Image Analysis&Retrv.2016 p.13
π₯ π ~
π=1
πΎ
π€ π π(π π, π π)
π π π, π π =
1
2π
π
2 Ξ£ π
1/2
πβ
1
2
π₯βπ π
β²Ξ£ π
β1
(π₯βπ π)
14. A bit of Theory: Fisher Kernel
ο±Encode the derivation from the generative model
ο§ Observed feature set, {x1, x2, β¦,xn} in Rd, e.g, d=128 for
SIFT.
ο§ Howβs these observations derivate from the given GMM
model with a set of parameter, π = π€ π, π π, π π ?
o i.e, how the parameter, e.g, mean will move to best fit the observation
?
Z. Li, Image Analysis&Retrv.2016 p.14
π4
π3
π2
π1
X1
+
15. A bit of Theory: Fisher Kernel
ο±Score function w.r.t. the likelihood function π π(π)
ο§ πΊπ
π
= π»π log π’ π(π): derivative on the log likelihood
ο§ The dimension of score function is m, where m is the number
of generative model parameters, m=3 for GMM
ο§ Given the observed data X, score function indicate how
likelihood function parameter (e.g, mean) should move to
better fit the data.
ο±Distance/Derivation of two observation X, Y w.r.t the
generative model
ο§ Fisher Info Matrix (roughly the covariance in the Mahanolibis
distance)
πΉπ = πΈ π πΊπ
π
πΊπ
πβ²
ο§ Fisher Kernel Distance: normalized by the Fisher Info
Matrix:
Z. Li, Image Analysis&Retrv.2016 p.15
πΎπΉπΎ π, π = πΊπ
πβ²
πΉπ
β1
πΊπ
π
16. Fisher Vector
ο± KFK(X, Y) is a measure of similarity,
w.r.t. the generative model
ο§ Similar to the Mahanolibis distance case,
we can decompose this kernel as,
ο§ That give us a kernel feature mappingof
X to Fisher Vector
ο§ For observed images features {xt}, can
be computed as,
Z. Li, Image Analysis&Retrv.2016 p.16
πΎπΉπΎ π, π = πΊπ
πβ²
πΉπ
β1
πΊπ
π
= πΊπ
πβ²
πΏ πβ²πΏ π πΊπ
π
17. GMM Fisher Vector
ο±Encode the derivation from the generative model
ο§ Observed feature set, {x1, x2, β¦,xn} in Rd, e.g, d=128 (!) for SIFT.
ο§ Howβs these observations derivate from the given GMM model with a set
of parameter, π = π π, π π, π π ?
ο± GMM Log Likelihood Gradient
ο§ Let π€ π =
π π π
π π
π π
, Then we have
Z. Li, Image Analysis&Retrv.2016 p.17
weight
mean
variance
18. GMM Fisher Vector VL_FEAT implementation
ο± GMM codebook
ο§ For a K-component GMM, we only allow 3K parameters,
π π, π π, π π π = 1. . πΎ}, i.e, iid Gaussian component
ο± Posterior prob of feature point xi to GMM component k
Z. Li, Image Analysis&Retrv.2016 p.18
Ξ£ π =
π π 0 0 0
0 π π 0 0
β¦.
π π
19. GMM Fisher Vector VL_FEAT implementation
ο± FV encoding
ο§ Gradient on the mean, for GMM component k, j=1..D
ο§ In the end, we have 2K x D aggregation on the derivation
w.r.t. the means and variances
Z. Li, Image Analysis&Retrv.2016 p.19
πΉπ = [π’1, π’2,β¦ , π’ πΎ, π£1, π£2, β¦ , π£ πΎ]
20. VL_FEAT GMM/FV API
ο± Compute GMM model with VL_FEAT
ο§ Prepare data:
numPoints = 1000 ; dimension = 2 ;
data = rand(dimension,N) ;
ο§ Call vl_gmm:
numClusters = 30 ;
[means, covariances, priors] = vl_gmm(data, numClusters) ;
ο§ Visualize:
figure ;
hold on ;
plot(data(1,:),data(2,:),'r.') ;
for i=1:numClusters
vl_plotframe([means(:,i)' sigmas(1,i) 0 sigmas(2,i)]);
end
Z. Li, Image Analysis&Retrv.2016 p.20
21. VL_FEAT API
ο± FV encoding
encoding = vl_fisher(datatoBeEncoded, means, covariances,
priors);
ο± Bonus points:
ο§ Encode HoG features with Fisher Vector ?
ο§ randomly collect 2~3 images from each class
ο§ Stack all HoG features together into an n x 36 data matrix
ο§ Compute its GMM
ο§ Use this GMM to encode all image HoG features (other than
average)
Z. Li, Image Analysis&Retrv.2016 p.21
22. Super Vector Aggregation β Speaker ID
ο± Fisher Vector: Aggregates Features against a GMM
ο± Super Vector: Aggregates GMM against GMM
ο§ Ref:
o William M. Campbell, Douglas E. Sturim, Douglas A. Reynolds: Support vector
machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett.
13(5): 308-311(2006)
Z. Li, Image Analysis&Retrv.2016 p.22
βYes, We Can !β
?
23. Super Vector from MFCC
ο± Motivated from Speaker ID work
ο§ Speech is a continuousevolution of the vocal tract
ο§ Need to extract a sequence of spectra or sequence of spectral coefficients
ο§ Use a sliding window - 25 ms window, 10 ms shift
Z. Li, Image Analysis&Retrv.2016 p.23
DCTLog|X(Ο)|
MFCC
24. GMM Model from MFCC
ο± GMM on MFCC feature
Z. Li, Image Analysis&Retrv.2016 p.24
ο₯ο½
οο½ο
M
j
s
j
s
j
s
j
s
pp
1
)()()()(
),|()|( οο¬ xx
β’ The acoustic vectors (MFCC) of speaker s is modeled by a
prob. density function parameterized by
M
j
s
j
s
j
s
j
s
1
)()()()(
},,{ ο½οο½ο οο¬
β’ Gaussian mixture model (GMM) for speaker s:
M
j
s
j
s
j
s
j
s
1
)()()()(
},,{ ο½οο½ο οο¬
25. Universal Background Model
ο± UBM GMM Model:
Z. Li, Image Analysis&Retrv.2016 p.25
ο₯ο½
οο½ο
M
j
jjj pp
1
)ubm()ubm()ubm()ubm(
),|()|( οο¬ xx
β’ The acoustic vectors of a general population is modeled by
another GMM called the universal background model
(UBM):
β’ Parameters of the UBM
M
jjjj 1
)ubm()ubm()ubm()ubm(
},,{ ο½οο½ο οο¬
26. MAP Adaption
ο± Given the UBM GMM, how is the new observation
derivate ?
ο§ The adapted mean is given by:
Z. Li, Image Analysis&Retrv.2016 p.26
27. Supervector Distance
ο± Assuming we have UBM GMM model
π ππ΅π = {ππ, π π, Ξ£ π},
with identical prior and covariance
ο±Then for two utterance samples a and b, with GMM models
ο§ π π = {ππ, π π
π
, Ξ£ π},
ο§ π π = {ππ, π π
π
,Ξ£ π},
The SV distance is,
It means the means of two models need to be normalized by the UBM
covariance induced Mahanolibis distance metric
This is also a linear kernel function scaled by the UBM covariances
Z. Li, Image Analysis&Retrv.2016 p.27
πΎ π π, π π =
π
ππΞ£ π
β(
1
2
)
π π
π
π
( ππΞ£ π
β(
1
2
)
π π
π)
28. Supervector Performance in NIST Speaker ID
ο± System 5: Gaussian SV
ο§ DCF (Detection Cost Function)
Z. Li, Image Analysis&Retrv.2016 p.28
29. m31491
AKULA β Adaptive KLUster Aggregation
2013/10/25
Abhishek Nagar, Zhu Li, Gaurav Srivastava and Kyungmo Park
Z. Li, Image Analysis&Retrv.2016 p.29
31. Motivation
ο±Better Aggregation
ο§ Fisher Vector and VLAD type aggregation depending on a
global model
ο§ AKULA removes this dependence, and directly coding the
cluster centroids and sift count
ο§ SCFV/RVD all having situations where clusters are turned off
due to no assignment, this can be avoided in AKULA
SIFTdetection & selection K-means AKULA description
Z. Li, Image Analysis&Retrv.2016 p.31
32. Motivation
ο±Better Subspace Choice
ο§ Both SCFV and RVD do fixed normalization and PCA
projection based on heuristic.
ο§ What is the best possible subspace to do the aggregation ?
ο§ Using a boosting scheme to keep adding subspaces and
aggregations in an iterative fashion, and tune TPR-FPR to
the desired operating points on FPR.
Z. Li, Image Analysis&Retrv.2016 p.32
33. CE2: AKULA β Adaptive KLUster Aggregation
ο±AKULA Descriptor: cluster centroids +
SIFT count
A2={yc2
1, yc2
2, β¦, yc2
k ; pc2
1, pc2
2, β¦, pc2
k }
ο±Distance metric:
ο§ Min centroids distance, weighted
by SIFT count
d A1 ,A2 =
1
π π=0
π
d πππ
1
π π€ πππ
1
(π) +
1
π π=0
π
d πππ
2
π π€ πππ
2
(π)
A1={yc1
1, yc1
2, β¦, yc1
k ; pc1
1, pc1
2, β¦, pc1
k },
d πππ
1
π = min
π
ππ,π
d πππ
2
π = min
π
ππ,π
w πππ
1
π = π€π,πβ , πβ = πππmin
π
ππ,π
w πππ
2
π = π€πβ,π, πβ = πππmin
π
ππ,π
Z. Li, Image Analysis&Retrv.2016 p.33
34. AKULA implementation in TM7
ο±Inner loop aggregation
ο§ Dimension is fixed at 8
ο§ Numb of clusters, or nc=8, 16, 32, to hit 64, 128, and 256
bytes
ο§ Quantization: scale by Β½ and quantized to int8, sift count is
8 bits, total (nc+1)*dim bytes per aggregation
Z. Li, Image Analysis&Retrv.2016 p.34
35. AKULA implementation in TM7
ο±Outer loop subspace optimization by boosting
ο§ Initial set of subspace models {Ak} computed from MIR
FLICKR data set SIFT extractions by k-means the space to
4096 clusters
ο§ Iterative search on subspaces to generate AKULA
aggregation that can improve performance in precision-
recall
ο§ Notice that aggregation is de-coupled in subspace iteration,
to allow more DoF in aggregation, to find subspaces that
provides complimentary info.
ο±The algorithm is still being debugged, hence only
having 1st iteration results in TM7
Z. Li, Image Analysis&Retrv.2016 p.35
36. AKULA implementation in TM7
ο±Outer loop subspace optimization by boosting
ο§ Initial set of subspace models {Ak} computed from MIR
FLICKR data set SIFT extractions by k-means the space to 4096
clusters
ο§ Iterative search on subspaces to generate AKULA aggregation
that can improve performance in precision-recall
ο§ Notice that aggregation is de-coupled in subspace iteration, to
allow more DoF in aggregation, to find subspaces that provides
complimentary info.
ο±The algorithm is still being debugged, hence only having
1st iteration results in TM7
ο±Indexing/Hashing is required for AKULA, it involves nc x
dim multiplications and additions at this time. A
binarization scheme will be considered once its
performance is optimized in non-binary form.
Z. Li, Image Analysis&Retrv.2016 p.36
37. GD Only TPR-FPR: AKULA vs SCFV
ο±Data set 1:
ο§ AKULA (128bytes, dim=8, nc=16) distance is just 1-way
dmin1.*wt
ο§ Forcing a weighted sum on SCFV (512 bytes) hamming
distances without 2D decision fitting, i.e, count hamming
distance between common active clusters, and sum up their
distances
Z. Li, Image Analysis&Retrv.2016 p.37
38. GD Only TPR-FPR: AKULA vs SCFV
ο±Data set 2, 3:
ο§ AKULA distance is just 1-way dmin1.*wt
ο§ AKULA=128bytes, SCFV = 512 bytes.
Z. Li, Image Analysis&Retrv.2016 p.38
39. 3D object set: 4 , 5
ο±Data set4, 5:
Z. Li, Image Analysis&Retrv.2016 p.39
46. AKULA Summary
ο±Benefits:
ο§ Allow more DoF in aggregation optimization,
o by an outer loop boosting scheme for subspace projection optimization
o And an inner loop adaptive clustering without the constraint of the
global GMM model
ο§ Simple weighted distance sum metric, with no need to tune a
multi-dimensional decision boundary
ο§ The overall pair wise matching matched up with TM7 SCFV
with 2-dimensional decision boundary
ο§ In GD only matching outperforms the TM7 GD
ο§ Good improvements to the localization accuracy
ο§ Light in extraction, but still heavy in pair wise matching, and
need binarization scheme and/or indexing scheme to work for
retrieval
ο± Future Improvements:
ο§ SupervectorAKULA ?
Z. Li, Image Analysis&Retrv.2016 p.46
47. Lec 08 Summary
ο± Fisher Vector
ο§ Aggregate features {Xk} in RD
against GMM
ο±Super Vector
ο§ Aggregate GMM against a global
GMM (UBM)
ο± AKULA
ο§ Direct Aggregation
Z. Li, Image Analysis&Retrv.2016 p.47
+
+ + +