This document analyzes identification systems for large-scale databases using information theory. It presents three key contributions:
1. It proposes a list-based decoder to improve identification performance while addressing issues of search and memory complexity.
2. It analyzes the trade-off between identification rate, search complexity, and memory complexity. Clustering the database is suggested to reduce search complexity.
3. It introduces active content fingerprinting (aCFP), which marries passive fingerprinting and watermarking techniques.
The document outlines these topics and provides statistical analysis of digital fingerprints extracted from correlated data. Performance of the proposed list decoder is also evaluated numerically.
Information-Theoretic Analysis of ID Systems in Large Databases
1. Information-Theoretic Analysis of Identification Systems
in
Large-Scale Databases
Farzad Farhadzadeh
Computer Science Department, University of Geneva, Switzerland
January 15, 2014
2. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Outline
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 2 / 40
3. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Motivation
JewelryPackaging
Physical Objects
Biometrics
Human
Digital Contents
Main concerns
◦ High dimensional data
◦ Highly correlated data
◦ Performance
◦ Search complexity
◦ Memory complexity
F. Farhadzadeh 3 / 40
4. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Identification Setup
source
W
selector
XN
(W ) observation
channel
Y N
XN
(1), · · · , XN
(M)
decoder
Data-
base
W
Identification rate R is called achievable, if for any δ > 0 there exist for
large enough N, decoders such that
1
N
log2 M ≥ R − δ,
PE ≤ δ.
Error probability:
PE
∆
=
1
M
M
w=1
Pr{W = w|W = w}
F. Farhadzadeh 4 / 40
5. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Identification Setup
source
W
selector
XN
(W ) observation
channel
Y N
XN
(1), · · · , XN
(M)
decoder
Data-
base
W
Theorem
Capacity of an identification system Cid , supremum of all achievable rates, is
given by [Willems et al.(2003)]
Cid = I(X; Y ),
where P(x, y) = Qs (x)Qc (y|x) for all x ∈ X, y ∈ Y.
F. Farhadzadeh 4 / 40
6. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Main Contributions
Content
Identification
Passive
Active
Digital
fingerprint
To address search–memory
complexity issues
List
decoder
To improve the identification
performance
Rid , Se , Me
trade–off
Identification rate, search and
memory complexity trade–off
aCFP Marriage of passive
fingerprinting and watermarking
F. Farhadzadeh 5 / 40
7. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 6 / 40
8. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Identification Setup
Content Identification Based on Binary Fingerprints
ψ(·)
XN
(M)
XN
(2)
XN
(1)
...
¯XL
(M)
¯XL
(2)
¯XL
(1)
...
. . .
P(Y N
| XN
) Y N
ψ(·) Decoder
¯Y L
Nl
XN
(W )
X N
identification
enrollment
Database
acquisition
channel
fingerprint
extraction
list of
candidates
Definition
Digital fingerprint: robust, short and discriminative content representation
F. Farhadzadeh 7 / 40
9. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Statistical Analysis of Digital Fingerprint
Digital Fingerprint from Correlated Data
XN
∼ N(0N
, Kxx)
Xn = ρXn−1 + Ξn
Ξn ∼ N(0, (1 − ρ2
)σ2
X )
W
k
˜XL
∼ N(0L
, K˜x˜x)
sign(·)
¯XL
∈ {0, 1}L
ψ(·)
˜XL
= W†
XN
W ∈ {±1/
√
N}N×L
Wij ∼ Bernoulli(0.5)
XN
˜X L
W
˜X
sign( ˜X)
1
0
F. Farhadzadeh 8 / 40
10. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Statistical Analysis of Digital Fingerprint
Digital Fingerprint from Correlated Data [Farhadzadeh et al.(2011)]
Proposition
Off-diagonal and diagonal elements of K˜x˜x can be bounded as follows
Pr max
i=j
|Kij
˜x˜x | > βσ2
X <
1
L
(Off-diagonal elements)
Pr max
i
|Kii
˜x˜x − σ2
X | > ασ2
X <
2
L
1
ρ
(Diagonal elements)
where β = 1−ρN
1−ρ
12
N
ln L, and α = 1−ρN−1
1−ρ
8
N
ρ ln L.
Remark
For a sufficiently large N and L, L ≤ N: β → 0 and α → 0, K˜x˜x converges to
σ2
X IL with high probability.
Gaussian: uncorrelated ⇒ independent ⇒ ¯XL
∼ i.i.d. Bernoulli 1
2
.
Conclusion
F. Farhadzadeh 9 / 40
11. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Statistical Analysis of Digital Fingerprint
Digital Fingerprint from Correlated Data [Farhadzadeh et al.(2011)]
Statistics of Query Fingerprint
XN
∈ RN
+
ZN
∼ N(0N
, σ2
Z IN )
Y N
W
k
˜Y L
∈ RN
sign(·)
¯Y L
ψ(·)
¯Y L
= sign( ˜Y L
) = sign(W†
Y N
) = sign(W†
XN
+ W†
ZN
)
Gaussian: uncorrelated ⇒ independent ⇒ ¯Y L
∼ i.i.d. Bernoulli 1
2
.
Conclusion
XN
+
ZN
Y N
1 − Pb
1 − Pb
Pb
Pb
¯XL ¯Y L
Pb = 1
π
arctan σZ
σX
⇒
Binary Symmetric Channel (BSC)Additive White Gaussian Noise (AWGN)
F. Farhadzadeh 10 / 40
12. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Performance Analysis
Constrained List-Based Decoder [Farhadzadeh et al.(2010)]
Binary case
rj
dH(¯yL
,¯xL
)
dH(¯yL
,¯xL
(r1))
· · ·
dH(¯yL
,¯xL
(rj−1))
dH(¯yL
,¯xL
(rj))
dH(¯yL
,¯xL
(rj+1))
· · ·
dH(¯yL
,¯xL
(rNl
))
· · ·
dH(¯yL
,¯xL
(rM))
Nl ≤ ηL ⇒ Nl
Probability of miss
Pm = 1 − Pci =
1
M
M
w=1
Pr{(w /∈ Nl ) ∪ (Dw > ηL) | Hw }
Pci , Probability of correct identification.
Probability of false acceptance
Pfa = Pr Nl = ∅ | H0 .
F. Farhadzadeh 11 / 40
13. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Performance Analysis
Probability of False Acceptance [Farhadzadeh et al.(2011)]
Pfa = Pr
M
w=1
Dw ≤ ηL | H0
Proposition
For a binary database with M = eLR
entries of length L, the probability of false
acceptance of the constrained list–based decoder, Pfa, for any Pb < η < 1
2
,
satisfies
Pfa ≤ exp[−L(ln 2 − H2(η) − R)],
where H2(η) = −η log η − (1 − η) log(1 − η).
Remarks
Pfa is the same as the unique decoder.
If R < ln 2 − H2(η), then L → ∞ implies Pfa → 0.
F. Farhadzadeh 12 / 40
14. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Performance Analysis
Probability of Miss [Farhadzadeh et al.(2011)]
Pm =Pr{(1 /∈ Nl ) ∪ (D1 > ηL) | H1}
= Pr{(1 /∈ Nl ) ∩ (D1 ≤ ηL) | H1}
PI
m
+ Pr{D1 > ηL | H1}
PII
m
Proposition
The probability of miss of the constrained list–based decoder, Pm, for any
Pb < η < 1
2
, satisfies
Pm ≤ exp[−L(ln 2 − H2(η) − R)]
Nl
+ exp[−LD(η Pb)],
where D(η Pb) = η log η
Pb
+ (1 − η) log (1−η)
(1−Pb)
.
Remarks
For Nl > 1 the first kind of miss probability, PI
m, decays faster.
If R < ln 2 − H2(η), then L → ∞ implies Pm → 0.
F. Farhadzadeh 13 / 40
15. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Numerical Evaluation
Feature Extraction
...
Block
16 × 16
DCT
DCT
DCT
...
DCT
(1, 2)
DCT
(1, 2)
DCT
(1, 2)
...
...
...
XN
Data Statitics
Feature domain RP domain Binary domain
N = 768 L = 32 L = 32
ρ maxi=j Kij
˜x˜x θ˜X maxi=j Kij
¯x¯x
ˆP
0.41 0.08 1.74 0.07 0.5
Remark
Projected data are approximately uncorrelated and follow Gaussian distribution.
F. Farhadzadeh 14 / 40
16. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Numerical Evaluation
Data Statistics
Distortion Feature domain RP domain Binary domain
Model Parameters maxi=j Kij
zz θZ max Kzx maxi=j Kij
˜z˜z θ˜Z max K˜z˜x maxi=j Kij
¯z¯z Pb
ˆPb max K¯z¯x
AWGN
PSNR
5 dB 0.03 2 0.03 0.08 2 0.09 0.08 0.20 0.21 0.09
10 dB 0.03 2 0.03 0.07 2 0.07 0.07 0.12 0.13 0.08
15 dB 0.03 2 0.03 0.08 2 0.08 0.08 0.07 0.08 0.09
20 dB 0.03 2 0.03 0.08 2 0.09 0.08 0.04 0.05 0.09
JPEG
QF
1 0.04 1.2 0.09 0.10 1.93 0.10 0.10 0.10 0.11 0.09
10 0.04 1.8 0.05 0.07 1.96 0.08 0.08 0.03 0.04 0.09
25 0.03 1.95 0.06 0.06 1.99 0.09 0.07 0.01 0.01 0.09
Histeq 0.15 0.75 0.49 0.20 1.19 0.30 0.12 0.14 0.1 0.09
Remark
Noise approximately follows Gaussian distribution and is independent of
content.
F. Farhadzadeh 15 / 40
17. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Numerical Evaluation
Identification Performance
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pf a
Pci
PSNR= 5, Nl = 1
PSNR= 20,Nl = 1
PSNR= 5, Nl = 2
PSNR= 5, Nl = 4
Remark
The list-based decoder improves the performance in a certain range of list sizes.
F. Farhadzadeh 16 / 40
18. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 17 / 40
19. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
source
W
selector
XN
(W ) observation
channel
Y N
XN
(1), · · · , XN
(M)
decoder
Data-
base
W
Search complexity: the decoder has to check exhaustively all xN
(w),
1 ≤ w ≤ M to find the best match.
Question: Can we speed up this process?
Idea: Do clustering.
F. Farhadzadeh 18 / 40
20. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
To find the best match:
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
F. Farhadzadeh 19 / 40
21. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
F. Farhadzadeh 19 / 40
22. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
◦ single cluster estimation
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
uN
(1)
F. Farhadzadeh 19 / 40
23. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
◦ single cluster estimation
◦ multiple cluster estimation
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
uN
(1)
uN
(2)
F. Farhadzadeh 19 / 40
24. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
◦ single cluster estimation
◦ multiple cluster estimation
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
uN
(1)
uN
(2)
Question: What is the fundamental trade-off between the number of
cluster-checks and refinement checks?
F. Farhadzadeh 19 / 40
25. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Generalized Model Description and Statement of result
Generalized Two-Stage Decoding [Farhadzadeh et al.(2013b)]
source
W
selector
XN
(W )
observation
channel
first
decoder
XN
(1), · · · , XN
(M)
Y N
second
decoder
W1(1)
W1(M3)
. . .
W1
W2 combiner
W
First decoder from Y N
, determines W1 = (W1(1), . . . , W1(M3)) and sends
them to second decoder.
Second decoder from Y N
and W1, determines W1 and W2 and sends them
to the combiner.
The combiner determines index W .
F. Farhadzadeh 20 / 40
26. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Generalized Model Description and Statement of result
Fundamental Trade-off [Farhadzadeh et al.(2013b)]
Theorem
The region of achievable quadruple rates of the identification system is given by
{(R1, R2, R3, R) : R1 ≥ I(X, Y ; U),
R2 ≥ max(0, R − I(X; U)),
R3 ≥ I(X; U | Y ),
0 ≤ R ≤ Cid = I(X; Y ),
for P(x, y, u) = Qs (x)Qc (y | x)P(u | x, y),
where |U| ≤ |Y| · |X| + 2}.
Remark
We have P(x, y, u) = Qs (x)Qc (y | x)P(u | x, y). The auxiliary random variable
U depends on both X and Y .
F. Farhadzadeh 21 / 40
27. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Generalized Model Description and Statement of result
Achievable Search-Clustering Schemes
Clustering
disjoint overlapped
Decoding
single
× X ↔ Y ↔ U
multiple
U ↔ X ↔ Y General
X ↔ Y ↔ U: Centroid
statistics depend on query
statistics [Willems(2009)]
U ↔ X ↔ Y : Centroid statistics
depend on database entries
[J´egou et al.(2011)]
General: Centroid statistics
depend on both
Question: What is the optimal scheme?
Idea: Search–Memory complexity analysis.
F. Farhadzadeh 22 / 40
28. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Search-Memory Complexity
Memory Complexity Exponent [Farhadzadeh et al.(2013b)]
uN
(1)
xN
(1, 1)
xN
(1, 2)
xN
(1, M2)
.
.
.
uN
(2)
xN
(2, 1)
xN
(2, 2)
xN
(2, M2)
.
.
.
uN
(M1)
xN
(M1, 1)
xN
(M1, 2)
xN
(M1, M2)
.
.
.
· · ·
– Clusters: 2NI(X,Y ;U)
of uN
(w1)
– Cluster members:
◦ R > I(U; X): 2N[R−I(U;X)] of xN (w) in each cluster,
◦ R < I(U; X): at most a single xN (w) in each cluster.
Me = max I(U; X, Y )
# of clus.
+ R − I(U; X)
# of items in each clus.
, I(U; X, Y )
F. Farhadzadeh 23 / 40
29. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Search-Memory Complexity
Search Complexity Exponent [Farhadzadeh et al.(2013b)]
yN
uN
(1)
xN
(1, 1)
xN
(1, 2)
xN
(1, M2)
.
.
.
uN
(2)
xN
(2, 1)
xN
(2, 2)
xN
(2, M2)
.
.
.
uN
(M1)
xN
(M1, 1)
xN
(M1, 2)
xN
(M1, M2)
.
.
.
· · ·
– First decoder: 2NI(U;X,Y )
cluster checks to construct W1
– Second decoder:
◦ R > I(U; X), 2N[R+I(U;X|Y )−I(U;X)] refinement checks,
◦ R < I(U; X), 2NI(U;X|Y ) refinement checks.
Se = max I(U; X, Y )
# of clus.
, max R − I(U; X)
# of items in each clus.
+ I(U; X|Y )
# of clus. est. by First dec.
, I(U; X|Y )
F. Farhadzadeh 24 / 40
30. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Binary Source
Binary Source
Example
Consider Qs (x) = 1/2, x ∈ {0, 1}, a BSC with cross-over probability q = 0.1,
R = 0.5 and U = {0, 1}.
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
0.48
0.5
0.52
0.54
0.56
0.58
0.6
0.62
Se
Me
U ↔ X ↔ Y
X ↔ Y ↔ U
Minimum Se : how?
Remark
The generalized scheme achieves smaller search-complexity.
The minimum search–complexity exponent is larger than R/2.
F. Farhadzadeh 25 / 40
31. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Binary Source
Minimizing Search Complexity [Farhadzadeh et al.(2014)]
Theorem
Let U = {0, 1}. The minimum search–complexity exponent
S∗
e = (1 − q)(1 − H2(p∗
1 ))
can be achieved if P(y | u) and P(x | u) are BSCs with the same cross-over
probability Pb = p∗
1 q/2, where p∗
1 = H−1
2 (1 − R/2) − q/2 /(1 − q).
S∗
e is achieved if I(U; X) = I(U; Y ) ⇒ I(U; X|Y )
# of clus. est. by First dec.
= I(U; Y |X)
# of clus. incl.
Conclusion
F. Farhadzadeh 26 / 40
32. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Numerical Evaluation
Numerical Evaluation
Fingerprint extraction
◦ We use an image database of 20, 828 gray–scaled images from ImageNet.
◦ All images are resized to 384 × 512 pixels.
◦ Binary fingerprint of length 64 is extracted from each image, R ≈ 0.22.
Identification performance, memory and complexity analysis
Distortion
PE (%)
Clustering Search complexity Me
Model Parameters
k-medians BBMM
k-medians BBMM
k-medians
BBMM
M1 = 180 M1 = 220
M2 ≈ M3 M2 ≈ M3 Se usage(%) Se usage(%)
AWGN
PSNR
40 dB 0.019 115 30 473 5 0.19 16.73 0.17 10.68 0.22 0.26
30 dB 0.024 115 70 568 6 0.21 38.85 0.18 14.95 0.22 0.26
20 dB 0.389 115 125 946 10 0.22 69.33 0.2 35.14 0.22 0.27
JPEG
QF
75 0.010 115 27 378 4 0.18 15.06 0.16 8.67 0.22 0.25
50 0.014 115 30 568 5 0.19 16.73 0.17 12.64 0.22 0.26
25 0.016 115 50 662 7 0.20 27.79 0.19 19.62 0.22 0.27
Histeq 3.140 115 140 1236 13 0.22 87.32 0.21 50.91 0.22 0.28
F. Farhadzadeh 27 / 40
33. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 28 / 40
34. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Conventional Passive Content Fingerprinting
Main Challenge
Complexity
Robust fingerprint
Binary fingerprint: Bounded Distance Decoder (BDD)
0
1 − Pb
0
1
1 − Pb
1
Pb
Pb
¯XL ¯Y L ⇒ ¯yL
θL
θ ∝ Pb
Relatively large Pb ⇒ large Hamming sphere ⇒ high complexity
however
Robust fingerprint (small Pb) ⇒ small Hamming sphere ⇒ low complexity
One solution
Active Content Fingerprint: modify content to make its fingerprint (FP) more
robust [Voloshynovskiy et al.(2012)].
F. Farhadzadeh 29 / 40
35. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Unidimensional Case of aCFP
Shrinkage-Based aCFP (SbaCFP) [Farhadzadeh et al.(2013a)]
To make FP more robust
Original distribution
p(˜x)
˜x−γ +γ
less
robust
bit flip
˜x
ϕs (˜x)
+γ
−γ
+γ−γ
Modulator function
⇒
1
2
− Q
γ
σX
1
2
− Q
γ
σX
p(ϕs (˜x))
ϕs (˜x)−γ 0 +γ
Modulated distribution
bit flip
Pb = Pr sign( ˜Xi ) = sign( ˜Yi ) = E Q
|ϕs ( ˜X)|
σZ
F. Farhadzadeh 30 / 40
36. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Unidimensional Case of aCFP
Shrinkage-Based aCFP [Farhadzadeh et al.(2013a)]
Implementation:
XN
WF
˜XN Div-
ider
˜XL
ϕs (·)
˜XN−L
c
Com-
biner
W−1
F V N
(modified content)
sign ¯XL
k
Modulator
WF ∈ {±1/
√
N}N×N
W ⊂ WF
Ds =
1
N
E XN
− V N
2
2
= 2
L
N
γ
0
(γ − t)2
p˜X (t)dt
F. Farhadzadeh 31 / 40
37. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Unidimensional Case of aCFP
Analytical comparison
0 5 10 15 20 25
10
−30
10
−20
10
−10
10
0
DNR
Pb
pCFP
LB
SbaCFP
Comparison of pCFP, SbaCFP and LB in DWM, DWR=24dB and L/N = 0.01.
Advantages:
◦ more robust
◦ lower complexity
Disadvantages:
◦ quality degradation
◦ still random structure: BDD
F. Farhadzadeh 32 / 40
38. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Multidimensional Case of aCFP
Lattice-Based aCFP (LbaCFP) [Farhadzadeh et al.(2013a)]
From random to structural fingerprint
XN
∈ RN
W
˜XL
∈ RL
Remark
Lattice quantizer is used to modify contents to impose structure to ˜XL
.
˜XL
Use Leech lattice in R24
[Leech(1967)]
Largest kissing number
Largest packing density
Very fast decoder: about 519 operations
[Amrani and Beery(1996)]
F. Farhadzadeh 33 / 40
39. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Multidimensional Case of aCFP
Lattice-Based aCFP [Farhadzadeh et al.(2013a)]
Implementation:
XN
WF
˜XN Div-
ider
˜XL
ϕΛ(·)
˜XN−L
c
Com-
biner
W−1
F V N
(modified content)
¯xL
k
Modulator
WF ∈ {±1/
√
N}N×N
W ⊂ WF
DΛ =
1
N
E XN
− V N
2
2
=
L
N
G(Λ, V)V 2/L
normalized
second moment
volume of
Voronoi region V
Advantages:
◦ more robust
◦ very low complexity
Disadvantages:
◦ quality degradation
◦ non-uniform distribution
F. Farhadzadeh 34 / 40
40. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Numerical Evaluation
Numerical Evaluation
Fingerprint extraction
We employ a real image database of 1, 338 gray–scaled images from UCID
[Schaefer and Stich(2004)].
Binary fingerprint is extracted from each modulated image.
Different modulation schemes
Modulators Parameters PSDR MSSIM
SbaCFP
L = 192, γ = 60 53 dB 0.999
L = 32, γ = 110 53 dB 0.999
LbaCFP scale= 70, L = 24 53 dB 0.999
F. Farhadzadeh 35 / 40
41. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Numerical Evaluation
Performance Analysis
Unidimensional case
Modulator Distortion
Parameters Performance
im. domain proj. domain Pci Pfa
ˆPb Pb Pb−pCFP Pci−pCFP
SbaCFP
L = 32
θ = 2/L
γ = 110
AWGN
PSNR=20dB DNR=18dB 1 0 0 0 0.05 0.80
15dB 13dB 0.998 0 0.004 0.003 0.08 0.55
10dB 8dB 0.76 0 0.05 0.05 0.13 0.21
5 B 3dB 0.15 0 0.15 0.14 0.21 0.05
JPEG
QF=25 27dB 1 0 0 0 0.01 0.98
10 20dB 1 0 0 0 0.04 0.86
1 10dB 0.92 0 0.03 0.02 0.11 0.33
Histeq 4dB 0.94 0 0.02 0.12 0.14 0.43
Remark
Pb’s of active content fingerprinting show remarkable improvements with
respect to pCFP that leads to high Pci .
F. Farhadzadeh 36 / 40
42. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Numerical Evaluation
Performance Analysis
Multidimensional Case
Distortion
Parameters Pci
image domain projection domain
SbaCFP LbaCFP
L = 192, θ = 2/L L = 24
AWGN
PSNR=20 dB DNR=18 dB 0.96 1
15 dB 13 dB 0.07 0.83
10 dB 8 dB 0 0.01
5 dB 3 dB 0 0
JPEG
QF=25 27 dB 1 1
10 19 dB 0.998 1
1 10 dB 0 0.14
Histeq 6 dB (LbaCFP 3 dB) 0.3 0.11
Remark
Except Histeq, LbaCFP outperforms unidimensional modulation schemes.
F. Farhadzadeh 37 / 40
43. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Conclusions and Future Work
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 38 / 40
44. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Conclusions and Future Work
Conclusions
We introduced an identification setup based on the constrained list-based
decoder and analyzed its performance.
We analyzed a simple digital fingerprinting approach based on random
projections. Random projections not only can reduce the data
dimensionality but can also eliminate correlation among data samples.
We investigated a two–stage decoding scheme capable of achieving the
identification capacity with the search complexity less than the exhaustive
search.
We presented active content fingerprinting taking the best of content
fingerprinting and digital watermarking to overcome some of the
fundamental restrictions of these techniques.
F. Farhadzadeh 39 / 40
45. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Conclusions and Future Work
Future Works
Identification based on unique representative −→ multiple representative
First order autoregressive process −→ higher order autoregressive process
Two–stage decoding in identification −→ information retrieval
Identification rate, search and memory complexity trade-offs
−→ including security and privacy leakage trade–offs
F. Farhadzadeh 40 / 40
46.
47. References
Amrani, O., Beery, Y., 1996, Efficient bounded-distance decoding of the hexacode and associated decoders for the leech lattice
and the golay code, IEEE Trans. on Com., 44, 534 –537
Farhadzadeh, F., Voloshynovskiy, S., Koval, O., 2010, Performance analysis of identification system based on order statistics list
decoder, in Proc. of IEEE International Symposium on Information Theory (ISIT), Austin, TX
Farhadzadeh, F., Voloshynovskiy, S., Koval, O., Beekhof, F., 2011, Information-theoretic analysis of content based identification
for correlated data, in IEEE Information Theory Workshop (ITW), pp. 205–209, Paraty, Brazil
Farhadzadeh, F., Voloshynovskiy, S., Holotyak, T., Beekhof, F., 2013a, Active content fingerprinting: Shrinkage and lattice based
modulations, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada
Farhadzadeh, F., Willems, F. M., Voloshynovskiy, S., 2013b, Fundamental limits of identification: Identification rate, search and
memory complexity trade–off, in IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey
Farhadzadeh, F., Sun, K., Ferdowsi, S., 2014, Efficient two stage decoding scheme to achieve content identification capacity,
submitted
J´egou, H., Douze, M., Schmid, C., 2011, Product quantization for nearest neighbor search, Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 33, 117–128
Leech, J., 1967, Notes on sphere packings, Canadian Journal of Mathematics
Schaefer, G., Stich, M., 2004, Ucid - an uncompressed colour image database, in Storage and Retrieval Methods and Applications
for Multimedia, Proc.of SPIE
Voloshynovskiy, S., Farhadzadeh, F., Koval, O., Holotyak, T., 2012, Active content fingerprinting: a marriage of digital
watermarking and content fingerprinting, in IEEE WIFS, Tenerife, Spain
Willems, F., 2009, Searching methods for biometric identification systems: Fundamental limits, in IEEE International Symposium
on Information Theory (ISIT), pp. 2241 –2245
Willems, F., Kalker, T., Goseling, J., Linnartz, J., 2003, On the capacity of a biometrical identification system, in Proc. of IEEE
International Symposium on Information Theory, p. 82