SlideShare a Scribd company logo
1 of 47
Download to read offline
Information-Theoretic Analysis of Identification Systems
in
Large-Scale Databases
Farzad Farhadzadeh
Computer Science Department, University of Geneva, Switzerland
January 15, 2014
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Outline
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 2 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Motivation
JewelryPackaging
Physical Objects
Biometrics
Human
Digital Contents
Main concerns
◦ High dimensional data
◦ Highly correlated data
◦ Performance
◦ Search complexity
◦ Memory complexity
F. Farhadzadeh 3 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Identification Setup
source
W
selector
XN
(W ) observation
channel
Y N
XN
(1), · · · , XN
(M)
decoder
Data-
base
W
Identification rate R is called achievable, if for any δ > 0 there exist for
large enough N, decoders such that
1
N
log2 M ≥ R − δ,
PE ≤ δ.
Error probability:
PE
∆
=
1
M
M
w=1
Pr{W = w|W = w}
F. Farhadzadeh 4 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Identification Setup
source
W
selector
XN
(W ) observation
channel
Y N
XN
(1), · · · , XN
(M)
decoder
Data-
base
W
Theorem
Capacity of an identification system Cid , supremum of all achievable rates, is
given by [Willems et al.(2003)]
Cid = I(X; Y ),
where P(x, y) = Qs (x)Qc (y|x) for all x ∈ X, y ∈ Y.
F. Farhadzadeh 4 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Introduction
Main Contributions
Content
Identification
Passive
Active
Digital
fingerprint
To address search–memory
complexity issues
List
decoder
To improve the identification
performance
Rid , Se , Me
trade–off
Identification rate, search and
memory complexity trade–off
aCFP Marriage of passive
fingerprinting and watermarking
F. Farhadzadeh 5 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 6 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Identification Setup
Content Identification Based on Binary Fingerprints
ψ(·)
XN
(M)
XN
(2)
XN
(1)
...
¯XL
(M)
¯XL
(2)
¯XL
(1)
...
. . .
P(Y N
| XN
) Y N
ψ(·) Decoder
¯Y L
Nl
XN
(W )
X N
identification
enrollment
Database
acquisition
channel
fingerprint
extraction
list of
candidates
Definition
Digital fingerprint: robust, short and discriminative content representation
F. Farhadzadeh 7 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Statistical Analysis of Digital Fingerprint
Digital Fingerprint from Correlated Data
XN
∼ N(0N
, Kxx)
Xn = ρXn−1 + Ξn
Ξn ∼ N(0, (1 − ρ2
)σ2
X )
W
k
˜XL
∼ N(0L
, K˜x˜x)
sign(·)
¯XL
∈ {0, 1}L
ψ(·)
˜XL
= W†
XN
W ∈ {±1/
√
N}N×L
Wij ∼ Bernoulli(0.5)
XN
˜X L
W
˜X
sign( ˜X)
1
0
F. Farhadzadeh 8 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Statistical Analysis of Digital Fingerprint
Digital Fingerprint from Correlated Data [Farhadzadeh et al.(2011)]
Proposition
Off-diagonal and diagonal elements of K˜x˜x can be bounded as follows
Pr max
i=j
|Kij
˜x˜x | > βσ2
X <
1
L
(Off-diagonal elements)
Pr max
i
|Kii
˜x˜x − σ2
X | > ασ2
X <
2
L
1
ρ
(Diagonal elements)
where β = 1−ρN
1−ρ
12
N
ln L, and α = 1−ρN−1
1−ρ
8
N
ρ ln L.
Remark
For a sufficiently large N and L, L ≤ N: β → 0 and α → 0, K˜x˜x converges to
σ2
X IL with high probability.
Gaussian: uncorrelated ⇒ independent ⇒ ¯XL
∼ i.i.d. Bernoulli 1
2
.
Conclusion
F. Farhadzadeh 9 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Statistical Analysis of Digital Fingerprint
Digital Fingerprint from Correlated Data [Farhadzadeh et al.(2011)]
Statistics of Query Fingerprint
XN
∈ RN
+
ZN
∼ N(0N
, σ2
Z IN )
Y N
W
k
˜Y L
∈ RN
sign(·)
¯Y L
ψ(·)
¯Y L
= sign( ˜Y L
) = sign(W†
Y N
) = sign(W†
XN
+ W†
ZN
)
Gaussian: uncorrelated ⇒ independent ⇒ ¯Y L
∼ i.i.d. Bernoulli 1
2
.
Conclusion
XN
+
ZN
Y N
1 − Pb
1 − Pb
Pb
Pb
¯XL ¯Y L
Pb = 1
π
arctan σZ
σX
⇒
Binary Symmetric Channel (BSC)Additive White Gaussian Noise (AWGN)
F. Farhadzadeh 10 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Performance Analysis
Constrained List-Based Decoder [Farhadzadeh et al.(2010)]
Binary case
rj
dH(¯yL
,¯xL
)
dH(¯yL
,¯xL
(r1))
· · ·
dH(¯yL
,¯xL
(rj−1))
dH(¯yL
,¯xL
(rj))
dH(¯yL
,¯xL
(rj+1))
· · ·
dH(¯yL
,¯xL
(rNl
))
· · ·
dH(¯yL
,¯xL
(rM))
Nl ≤ ηL ⇒ Nl
Probability of miss
Pm = 1 − Pci =
1
M
M
w=1
Pr{(w /∈ Nl ) ∪ (Dw > ηL) | Hw }
Pci , Probability of correct identification.
Probability of false acceptance
Pfa = Pr Nl = ∅ | H0 .
F. Farhadzadeh 11 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Performance Analysis
Probability of False Acceptance [Farhadzadeh et al.(2011)]
Pfa = Pr
M
w=1
Dw ≤ ηL | H0
Proposition
For a binary database with M = eLR
entries of length L, the probability of false
acceptance of the constrained list–based decoder, Pfa, for any Pb < η < 1
2
,
satisfies
Pfa ≤ exp[−L(ln 2 − H2(η) − R)],
where H2(η) = −η log η − (1 − η) log(1 − η).
Remarks
Pfa is the same as the unique decoder.
If R < ln 2 − H2(η), then L → ∞ implies Pfa → 0.
F. Farhadzadeh 12 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Performance Analysis
Probability of Miss [Farhadzadeh et al.(2011)]
Pm =Pr{(1 /∈ Nl ) ∪ (D1 > ηL) | H1}
= Pr{(1 /∈ Nl ) ∩ (D1 ≤ ηL) | H1}
PI
m
+ Pr{D1 > ηL | H1}
PII
m
Proposition
The probability of miss of the constrained list–based decoder, Pm, for any
Pb < η < 1
2
, satisfies
Pm ≤ exp[−L(ln 2 − H2(η) − R)]
Nl
+ exp[−LD(η Pb)],
where D(η Pb) = η log η
Pb
+ (1 − η) log (1−η)
(1−Pb)
.
Remarks
For Nl > 1 the first kind of miss probability, PI
m, decays faster.
If R < ln 2 − H2(η), then L → ∞ implies Pm → 0.
F. Farhadzadeh 13 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Numerical Evaluation
Feature Extraction
...
Block
16 × 16
DCT
DCT
DCT
...
DCT
(1, 2)
DCT
(1, 2)
DCT
(1, 2)
...
...
...
XN
Data Statitics
Feature domain RP domain Binary domain
N = 768 L = 32 L = 32
ρ maxi=j Kij
˜x˜x θ˜X maxi=j Kij
¯x¯x
ˆP
0.41 0.08 1.74 0.07 0.5
Remark
Projected data are approximately uncorrelated and follow Gaussian distribution.
F. Farhadzadeh 14 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Numerical Evaluation
Data Statistics
Distortion Feature domain RP domain Binary domain
Model Parameters maxi=j Kij
zz θZ max Kzx maxi=j Kij
˜z˜z θ˜Z max K˜z˜x maxi=j Kij
¯z¯z Pb
ˆPb max K¯z¯x
AWGN
PSNR
5 dB 0.03 2 0.03 0.08 2 0.09 0.08 0.20 0.21 0.09
10 dB 0.03 2 0.03 0.07 2 0.07 0.07 0.12 0.13 0.08
15 dB 0.03 2 0.03 0.08 2 0.08 0.08 0.07 0.08 0.09
20 dB 0.03 2 0.03 0.08 2 0.09 0.08 0.04 0.05 0.09
JPEG
QF
1 0.04 1.2 0.09 0.10 1.93 0.10 0.10 0.10 0.11 0.09
10 0.04 1.8 0.05 0.07 1.96 0.08 0.08 0.03 0.04 0.09
25 0.03 1.95 0.06 0.06 1.99 0.09 0.07 0.01 0.01 0.09
Histeq 0.15 0.75 0.49 0.20 1.19 0.30 0.12 0.14 0.1 0.09
Remark
Noise approximately follows Gaussian distribution and is independent of
content.
F. Farhadzadeh 15 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Fingerprint Statistics and Constrained List-Based Decoder
Numerical Evaluation
Identification Performance
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pf a
Pci
PSNR= 5, Nl = 1
PSNR= 20,Nl = 1
PSNR= 5, Nl = 2
PSNR= 5, Nl = 4
Remark
The list-based decoder improves the performance in a certain range of list sizes.
F. Farhadzadeh 16 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 17 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
source
W
selector
XN
(W ) observation
channel
Y N
XN
(1), · · · , XN
(M)
decoder
Data-
base
W
Search complexity: the decoder has to check exhaustively all xN
(w),
1 ≤ w ≤ M to find the best match.
Question: Can we speed up this process?
Idea: Do clustering.
F. Farhadzadeh 18 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
To find the best match:
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
F. Farhadzadeh 19 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
F. Farhadzadeh 19 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
◦ single cluster estimation
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
uN
(1)
F. Farhadzadeh 19 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
◦ single cluster estimation
◦ multiple cluster estimation
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
uN
(1)
uN
(2)
F. Farhadzadeh 19 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Introduction
Search Complexity Reduction
To do clustering: assign the
database entries to clusters
◦ disjoint clusters
◦ overlapped clusters
To find the best match:
◦ single cluster estimation
◦ multiple cluster estimation
yN
uN
(1)
uN
(2)
uN
(w1)
uN
(M1)
xN
(1)
xN
(2)
xN
(i − 1)
xN
(i)
xN
(i + 1)
xN
(j − 1)
xN
(j)
xN
(j + 1)
xN
(k − 1)
xN
(k)
xN
(k + 1)
xN
(M)
uN
(1)
uN
(2)
Question: What is the fundamental trade-off between the number of
cluster-checks and refinement checks?
F. Farhadzadeh 19 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Generalized Model Description and Statement of result
Generalized Two-Stage Decoding [Farhadzadeh et al.(2013b)]
source
W
selector
XN
(W )
observation
channel
first
decoder
XN
(1), · · · , XN
(M)
Y N
second
decoder
W1(1)
W1(M3)
. . .
W1
W2 combiner
W
First decoder from Y N
, determines W1 = (W1(1), . . . , W1(M3)) and sends
them to second decoder.
Second decoder from Y N
and W1, determines W1 and W2 and sends them
to the combiner.
The combiner determines index W .
F. Farhadzadeh 20 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Generalized Model Description and Statement of result
Fundamental Trade-off [Farhadzadeh et al.(2013b)]
Theorem
The region of achievable quadruple rates of the identification system is given by
{(R1, R2, R3, R) : R1 ≥ I(X, Y ; U),
R2 ≥ max(0, R − I(X; U)),
R3 ≥ I(X; U | Y ),
0 ≤ R ≤ Cid = I(X; Y ),
for P(x, y, u) = Qs (x)Qc (y | x)P(u | x, y),
where |U| ≤ |Y| · |X| + 2}.
Remark
We have P(x, y, u) = Qs (x)Qc (y | x)P(u | x, y). The auxiliary random variable
U depends on both X and Y .
F. Farhadzadeh 21 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Generalized Model Description and Statement of result
Achievable Search-Clustering Schemes
Clustering
disjoint overlapped
Decoding
single
× X ↔ Y ↔ U
multiple
U ↔ X ↔ Y General
X ↔ Y ↔ U: Centroid
statistics depend on query
statistics [Willems(2009)]
U ↔ X ↔ Y : Centroid statistics
depend on database entries
[J´egou et al.(2011)]
General: Centroid statistics
depend on both
Question: What is the optimal scheme?
Idea: Search–Memory complexity analysis.
F. Farhadzadeh 22 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Search-Memory Complexity
Memory Complexity Exponent [Farhadzadeh et al.(2013b)]
uN
(1)
xN
(1, 1)
xN
(1, 2)
xN
(1, M2)
.
.
.
uN
(2)
xN
(2, 1)
xN
(2, 2)
xN
(2, M2)
.
.
.
uN
(M1)
xN
(M1, 1)
xN
(M1, 2)
xN
(M1, M2)
.
.
.
· · ·
– Clusters: 2NI(X,Y ;U)
of uN
(w1)
– Cluster members:
◦ R > I(U; X): 2N[R−I(U;X)] of xN (w) in each cluster,
◦ R < I(U; X): at most a single xN (w) in each cluster.
Me = max I(U; X, Y )
# of clus.
+ R − I(U; X)
# of items in each clus.
, I(U; X, Y )
F. Farhadzadeh 23 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Search-Memory Complexity
Search Complexity Exponent [Farhadzadeh et al.(2013b)]
yN
uN
(1)
xN
(1, 1)
xN
(1, 2)
xN
(1, M2)
.
.
.
uN
(2)
xN
(2, 1)
xN
(2, 2)
xN
(2, M2)
.
.
.
uN
(M1)
xN
(M1, 1)
xN
(M1, 2)
xN
(M1, M2)
.
.
.
· · ·
– First decoder: 2NI(U;X,Y )
cluster checks to construct W1
– Second decoder:
◦ R > I(U; X), 2N[R+I(U;X|Y )−I(U;X)] refinement checks,
◦ R < I(U; X), 2NI(U;X|Y ) refinement checks.
Se = max I(U; X, Y )
# of clus.
, max R − I(U; X)
# of items in each clus.
+ I(U; X|Y )
# of clus. est. by First dec.
, I(U; X|Y )
F. Farhadzadeh 24 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Binary Source
Binary Source
Example
Consider Qs (x) = 1/2, x ∈ {0, 1}, a BSC with cross-over probability q = 0.1,
R = 0.5 and U = {0, 1}.
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
0.48
0.5
0.52
0.54
0.56
0.58
0.6
0.62
Se
Me
U ↔ X ↔ Y
X ↔ Y ↔ U
Minimum Se : how?
Remark
The generalized scheme achieves smaller search-complexity.
The minimum search–complexity exponent is larger than R/2.
F. Farhadzadeh 25 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Binary Source
Minimizing Search Complexity [Farhadzadeh et al.(2014)]
Theorem
Let U = {0, 1}. The minimum search–complexity exponent
S∗
e = (1 − q)(1 − H2(p∗
1 ))
can be achieved if P(y | u) and P(x | u) are BSCs with the same cross-over
probability Pb = p∗
1 q/2, where p∗
1 = H−1
2 (1 − R/2) − q/2 /(1 − q).
S∗
e is achieved if I(U; X) = I(U; Y ) ⇒ I(U; X|Y )
# of clus. est. by First dec.
= I(U; Y |X)
# of clus. incl.
Conclusion
F. Farhadzadeh 26 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Identification Rate, Search and Memory Complexity Trade-off
Numerical Evaluation
Numerical Evaluation
Fingerprint extraction
◦ We use an image database of 20, 828 gray–scaled images from ImageNet.
◦ All images are resized to 384 × 512 pixels.
◦ Binary fingerprint of length 64 is extracted from each image, R ≈ 0.22.
Identification performance, memory and complexity analysis
Distortion
PE (%)
Clustering Search complexity Me
Model Parameters
k-medians BBMM
k-medians BBMM
k-medians
BBMM
M1 = 180 M1 = 220
M2 ≈ M3 M2 ≈ M3 Se usage(%) Se usage(%)
AWGN
PSNR
40 dB 0.019 115 30 473 5 0.19 16.73 0.17 10.68 0.22 0.26
30 dB 0.024 115 70 568 6 0.21 38.85 0.18 14.95 0.22 0.26
20 dB 0.389 115 125 946 10 0.22 69.33 0.2 35.14 0.22 0.27
JPEG
QF
75 0.010 115 27 378 4 0.18 15.06 0.16 8.67 0.22 0.25
50 0.014 115 30 568 5 0.19 16.73 0.17 12.64 0.22 0.26
25 0.016 115 50 662 7 0.20 27.79 0.19 19.62 0.22 0.27
Histeq 3.140 115 140 1236 13 0.22 87.32 0.21 50.91 0.22 0.28
F. Farhadzadeh 27 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 28 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Conventional Passive Content Fingerprinting
Main Challenge
Complexity
Robust fingerprint
Binary fingerprint: Bounded Distance Decoder (BDD)
0
1 − Pb
0
1
1 − Pb
1
Pb
Pb
¯XL ¯Y L ⇒ ¯yL
θL
θ ∝ Pb
Relatively large Pb ⇒ large Hamming sphere ⇒ high complexity
however
Robust fingerprint (small Pb) ⇒ small Hamming sphere ⇒ low complexity
One solution
Active Content Fingerprint: modify content to make its fingerprint (FP) more
robust [Voloshynovskiy et al.(2012)].
F. Farhadzadeh 29 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Unidimensional Case of aCFP
Shrinkage-Based aCFP (SbaCFP) [Farhadzadeh et al.(2013a)]
To make FP more robust
Original distribution
p(˜x)
˜x−γ +γ
less
robust
bit flip
˜x
ϕs (˜x)
+γ
−γ
+γ−γ
Modulator function
⇒
1
2
− Q
γ
σX
1
2
− Q
γ
σX
p(ϕs (˜x))
ϕs (˜x)−γ 0 +γ
Modulated distribution
bit flip
Pb = Pr sign( ˜Xi ) = sign( ˜Yi ) = E Q
|ϕs ( ˜X)|
σZ
F. Farhadzadeh 30 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Unidimensional Case of aCFP
Shrinkage-Based aCFP [Farhadzadeh et al.(2013a)]
Implementation:
XN
WF
˜XN Div-
ider
˜XL
ϕs (·)
˜XN−L
c
Com-
biner
W−1
F V N
(modified content)
sign ¯XL
k
Modulator
WF ∈ {±1/
√
N}N×N
W ⊂ WF
Ds =
1
N
E XN
− V N
2
2
= 2
L
N
γ
0
(γ − t)2
p˜X (t)dt
F. Farhadzadeh 31 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Unidimensional Case of aCFP
Analytical comparison
0 5 10 15 20 25
10
−30
10
−20
10
−10
10
0
DNR
Pb
pCFP
LB
SbaCFP
Comparison of pCFP, SbaCFP and LB in DWM, DWR=24dB and L/N = 0.01.
Advantages:
◦ more robust
◦ lower complexity
Disadvantages:
◦ quality degradation
◦ still random structure: BDD
F. Farhadzadeh 32 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Multidimensional Case of aCFP
Lattice-Based aCFP (LbaCFP) [Farhadzadeh et al.(2013a)]
From random to structural fingerprint
XN
∈ RN
W
˜XL
∈ RL
Remark
Lattice quantizer is used to modify contents to impose structure to ˜XL
.
˜XL
Use Leech lattice in R24
[Leech(1967)]
Largest kissing number
Largest packing density
Very fast decoder: about 519 operations
[Amrani and Beery(1996)]
F. Farhadzadeh 33 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Multidimensional Case of aCFP
Lattice-Based aCFP [Farhadzadeh et al.(2013a)]
Implementation:
XN
WF
˜XN Div-
ider
˜XL
ϕΛ(·)
˜XN−L
c
Com-
biner
W−1
F V N
(modified content)
¯xL
k
Modulator
WF ∈ {±1/
√
N}N×N
W ⊂ WF
DΛ =
1
N
E XN
− V N
2
2
=
L
N
G(Λ, V)V 2/L
normalized
second moment
volume of
Voronoi region V
Advantages:
◦ more robust
◦ very low complexity
Disadvantages:
◦ quality degradation
◦ non-uniform distribution
F. Farhadzadeh 34 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Numerical Evaluation
Numerical Evaluation
Fingerprint extraction
We employ a real image database of 1, 338 gray–scaled images from UCID
[Schaefer and Stich(2004)].
Binary fingerprint is extracted from each modulated image.
Different modulation schemes
Modulators Parameters PSDR MSSIM
SbaCFP
L = 192, γ = 60 53 dB 0.999
L = 32, γ = 110 53 dB 0.999
LbaCFP scale= 70, L = 24 53 dB 0.999
F. Farhadzadeh 35 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Numerical Evaluation
Performance Analysis
Unidimensional case
Modulator Distortion
Parameters Performance
im. domain proj. domain Pci Pfa
ˆPb Pb Pb−pCFP Pci−pCFP
SbaCFP
L = 32
θ = 2/L
γ = 110
AWGN
PSNR=20dB DNR=18dB 1 0 0 0 0.05 0.80
15dB 13dB 0.998 0 0.004 0.003 0.08 0.55
10dB 8dB 0.76 0 0.05 0.05 0.13 0.21
5 B 3dB 0.15 0 0.15 0.14 0.21 0.05
JPEG
QF=25 27dB 1 0 0 0 0.01 0.98
10 20dB 1 0 0 0 0.04 0.86
1 10dB 0.92 0 0.03 0.02 0.11 0.33
Histeq 4dB 0.94 0 0.02 0.12 0.14 0.43
Remark
Pb’s of active content fingerprinting show remarkable improvements with
respect to pCFP that leads to high Pci .
F. Farhadzadeh 36 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Active Content Fingerprinting
Numerical Evaluation
Performance Analysis
Multidimensional Case
Distortion
Parameters Pci
image domain projection domain
SbaCFP LbaCFP
L = 192, θ = 2/L L = 24
AWGN
PSNR=20 dB DNR=18 dB 0.96 1
15 dB 13 dB 0.07 0.83
10 dB 8 dB 0 0.01
5 dB 3 dB 0 0
JPEG
QF=25 27 dB 1 1
10 19 dB 0.998 1
1 10 dB 0 0.14
Histeq 6 dB (LbaCFP 3 dB) 0.3 0.11
Remark
Except Histeq, LbaCFP outperforms unidimensional modulation schemes.
F. Farhadzadeh 37 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Conclusions and Future Work
Overview
Introduction
Fingerprint Statistics and Constrained List-Based Decoder
Identification Rate, Search and Memory Complexity Trade-off
Active Content Fingerprinting
Conclusions and Future Work
F. Farhadzadeh 38 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Conclusions and Future Work
Conclusions
We introduced an identification setup based on the constrained list-based
decoder and analyzed its performance.
We analyzed a simple digital fingerprinting approach based on random
projections. Random projections not only can reduce the data
dimensionality but can also eliminate correlation among data samples.
We investigated a two–stage decoding scheme capable of achieving the
identification capacity with the search complexity less than the exhaustive
search.
We presented active content fingerprinting taking the best of content
fingerprinting and digital watermarking to overcome some of the
fundamental restrictions of these techniques.
F. Farhadzadeh 39 / 40
Information-Theoretic Analysis of Identification Systems in Large-Scale Databases
Conclusions and Future Work
Future Works
Identification based on unique representative −→ multiple representative
First order autoregressive process −→ higher order autoregressive process
Two–stage decoding in identification −→ information retrieval
Identification rate, search and memory complexity trade-offs
−→ including security and privacy leakage trade–offs
F. Farhadzadeh 40 / 40
References
Amrani, O., Beery, Y., 1996, Efficient bounded-distance decoding of the hexacode and associated decoders for the leech lattice
and the golay code, IEEE Trans. on Com., 44, 534 –537
Farhadzadeh, F., Voloshynovskiy, S., Koval, O., 2010, Performance analysis of identification system based on order statistics list
decoder, in Proc. of IEEE International Symposium on Information Theory (ISIT), Austin, TX
Farhadzadeh, F., Voloshynovskiy, S., Koval, O., Beekhof, F., 2011, Information-theoretic analysis of content based identification
for correlated data, in IEEE Information Theory Workshop (ITW), pp. 205–209, Paraty, Brazil
Farhadzadeh, F., Voloshynovskiy, S., Holotyak, T., Beekhof, F., 2013a, Active content fingerprinting: Shrinkage and lattice based
modulations, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada
Farhadzadeh, F., Willems, F. M., Voloshynovskiy, S., 2013b, Fundamental limits of identification: Identification rate, search and
memory complexity trade–off, in IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey
Farhadzadeh, F., Sun, K., Ferdowsi, S., 2014, Efficient two stage decoding scheme to achieve content identification capacity,
submitted
J´egou, H., Douze, M., Schmid, C., 2011, Product quantization for nearest neighbor search, Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 33, 117–128
Leech, J., 1967, Notes on sphere packings, Canadian Journal of Mathematics
Schaefer, G., Stich, M., 2004, Ucid - an uncompressed colour image database, in Storage and Retrieval Methods and Applications
for Multimedia, Proc.of SPIE
Voloshynovskiy, S., Farhadzadeh, F., Koval, O., Holotyak, T., 2012, Active content fingerprinting: a marriage of digital
watermarking and content fingerprinting, in IEEE WIFS, Tenerife, Spain
Willems, F., 2009, Searching methods for biometric identification systems: Fundamental limits, in IEEE International Symposium
on Information Theory (ISIT), pp. 2241 –2245
Willems, F., Kalker, T., Goseling, J., Linnartz, J., 2003, On the capacity of a biometrical identification system, in Proc. of IEEE
International Symposium on Information Theory, p. 82

More Related Content

Similar to Information-Theoretic Analysis of ID Systems in Large Databases

Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016Purdue University
 
Scalable and Privacy-preserving Data Integration - Part 2
Scalable and Privacy-preserving Data Integration - Part 2Scalable and Privacy-preserving Data Integration - Part 2
Scalable and Privacy-preserving Data Integration - Part 2ErhardRahm
 
Ivanov paraling trento_2013_part1
Ivanov paraling trento_2013_part1Ivanov paraling trento_2013_part1
Ivanov paraling trento_2013_part1Alexei Ivanov
 
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.pptgrssieee
 
Adaptive blind multiuser detection under impulsive noise using principal comp...
Adaptive blind multiuser detection under impulsive noise using principal comp...Adaptive blind multiuser detection under impulsive noise using principal comp...
Adaptive blind multiuser detection under impulsive noise using principal comp...csandit
 
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...csandit
 
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...cscpconf
 
SLOPE 1st workshop - presentation 2
SLOPE 1st workshop - presentation 2SLOPE 1st workshop - presentation 2
SLOPE 1st workshop - presentation 2SLOPE Project
 
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...Improving circuit miniaturization and its efficiency using Rough Set Theory( ...
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...Sarvesh Singh
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
Maxim Kazantsev
 
Source coding for a mixed source: determination of second order asymptotics
Source coding for a mixed source: determination of second order asymptoticsSource coding for a mixed source: determination of second order asymptotics
Source coding for a mixed source: determination of second order asymptoticsFelix Leditzky
 
Privacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsPrivacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsEmiliano De Cristofaro
 
Cheat sheets for AI
Cheat sheets for AICheat sheets for AI
Cheat sheets for AINcib Lotfi
 

Similar to Information-Theoretic Analysis of ID Systems in Large Databases (20)

Open-process Algorithm Design
Open-process Algorithm DesignOpen-process Algorithm Design
Open-process Algorithm Design
 
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
 
Scalable and Privacy-preserving Data Integration - Part 2
Scalable and Privacy-preserving Data Integration - Part 2Scalable and Privacy-preserving Data Integration - Part 2
Scalable and Privacy-preserving Data Integration - Part 2
 
Ivanov paraling trento_2013_part1
Ivanov paraling trento_2013_part1Ivanov paraling trento_2013_part1
Ivanov paraling trento_2013_part1
 
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.ppt
 
Adaptive blind multiuser detection under impulsive noise using principal comp...
Adaptive blind multiuser detection under impulsive noise using principal comp...Adaptive blind multiuser detection under impulsive noise using principal comp...
Adaptive blind multiuser detection under impulsive noise using principal comp...
 
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
 
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
ADAPTIVE BLIND MULTIUSER DETECTION UNDER IMPULSIVE NOISE USING PRINCIPAL COMP...
 
SLOPE 1st workshop - presentation 2
SLOPE 1st workshop - presentation 2SLOPE 1st workshop - presentation 2
SLOPE 1st workshop - presentation 2
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
 
Thesis_Presentation
Thesis_PresentationThesis_Presentation
Thesis_Presentation
 
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...Improving circuit miniaturization and its efficiency using Rough Set Theory( ...
Improving circuit miniaturization and its efficiency using Rough Set Theory( ...
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 
Coding
CodingCoding
Coding
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

 
Source coding for a mixed source: determination of second order asymptotics
Source coding for a mixed source: determination of second order asymptoticsSource coding for a mixed source: determination of second order asymptotics
Source coding for a mixed source: determination of second order asymptotics
 
Privacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsPrivacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and Applications
 
Cheat sheets for AI
Cheat sheets for AICheat sheets for AI
Cheat sheets for AI
 
Ch6 information theory
Ch6 information theoryCh6 information theory
Ch6 information theory
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 

Information-Theoretic Analysis of ID Systems in Large Databases

  • 1. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Farzad Farhadzadeh Computer Science Department, University of Geneva, Switzerland January 15, 2014
  • 2. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Outline Introduction Fingerprint Statistics and Constrained List-Based Decoder Identification Rate, Search and Memory Complexity Trade-off Active Content Fingerprinting Conclusions and Future Work F. Farhadzadeh 2 / 40
  • 3. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Introduction Motivation JewelryPackaging Physical Objects Biometrics Human Digital Contents Main concerns ◦ High dimensional data ◦ Highly correlated data ◦ Performance ◦ Search complexity ◦ Memory complexity F. Farhadzadeh 3 / 40
  • 4. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Introduction Identification Setup source W selector XN (W ) observation channel Y N XN (1), · · · , XN (M) decoder Data- base W Identification rate R is called achievable, if for any δ > 0 there exist for large enough N, decoders such that 1 N log2 M ≥ R − δ, PE ≤ δ. Error probability: PE ∆ = 1 M M w=1 Pr{W = w|W = w} F. Farhadzadeh 4 / 40
  • 5. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Introduction Identification Setup source W selector XN (W ) observation channel Y N XN (1), · · · , XN (M) decoder Data- base W Theorem Capacity of an identification system Cid , supremum of all achievable rates, is given by [Willems et al.(2003)] Cid = I(X; Y ), where P(x, y) = Qs (x)Qc (y|x) for all x ∈ X, y ∈ Y. F. Farhadzadeh 4 / 40
  • 6. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Introduction Main Contributions Content Identification Passive Active Digital fingerprint To address search–memory complexity issues List decoder To improve the identification performance Rid , Se , Me trade–off Identification rate, search and memory complexity trade–off aCFP Marriage of passive fingerprinting and watermarking F. Farhadzadeh 5 / 40
  • 7. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Overview Introduction Fingerprint Statistics and Constrained List-Based Decoder Identification Rate, Search and Memory Complexity Trade-off Active Content Fingerprinting Conclusions and Future Work F. Farhadzadeh 6 / 40
  • 8. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Identification Setup Content Identification Based on Binary Fingerprints ψ(·) XN (M) XN (2) XN (1) ... ¯XL (M) ¯XL (2) ¯XL (1) ... . . . P(Y N | XN ) Y N ψ(·) Decoder ¯Y L Nl XN (W ) X N identification enrollment Database acquisition channel fingerprint extraction list of candidates Definition Digital fingerprint: robust, short and discriminative content representation F. Farhadzadeh 7 / 40
  • 9. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Statistical Analysis of Digital Fingerprint Digital Fingerprint from Correlated Data XN ∼ N(0N , Kxx) Xn = ρXn−1 + Ξn Ξn ∼ N(0, (1 − ρ2 )σ2 X ) W k ˜XL ∼ N(0L , K˜x˜x) sign(·) ¯XL ∈ {0, 1}L ψ(·) ˜XL = W† XN W ∈ {±1/ √ N}N×L Wij ∼ Bernoulli(0.5) XN ˜X L W ˜X sign( ˜X) 1 0 F. Farhadzadeh 8 / 40
  • 10. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Statistical Analysis of Digital Fingerprint Digital Fingerprint from Correlated Data [Farhadzadeh et al.(2011)] Proposition Off-diagonal and diagonal elements of K˜x˜x can be bounded as follows Pr max i=j |Kij ˜x˜x | > βσ2 X < 1 L (Off-diagonal elements) Pr max i |Kii ˜x˜x − σ2 X | > ασ2 X < 2 L 1 ρ (Diagonal elements) where β = 1−ρN 1−ρ 12 N ln L, and α = 1−ρN−1 1−ρ 8 N ρ ln L. Remark For a sufficiently large N and L, L ≤ N: β → 0 and α → 0, K˜x˜x converges to σ2 X IL with high probability. Gaussian: uncorrelated ⇒ independent ⇒ ¯XL ∼ i.i.d. Bernoulli 1 2 . Conclusion F. Farhadzadeh 9 / 40
  • 11. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Statistical Analysis of Digital Fingerprint Digital Fingerprint from Correlated Data [Farhadzadeh et al.(2011)] Statistics of Query Fingerprint XN ∈ RN + ZN ∼ N(0N , σ2 Z IN ) Y N W k ˜Y L ∈ RN sign(·) ¯Y L ψ(·) ¯Y L = sign( ˜Y L ) = sign(W† Y N ) = sign(W† XN + W† ZN ) Gaussian: uncorrelated ⇒ independent ⇒ ¯Y L ∼ i.i.d. Bernoulli 1 2 . Conclusion XN + ZN Y N 1 − Pb 1 − Pb Pb Pb ¯XL ¯Y L Pb = 1 π arctan σZ σX ⇒ Binary Symmetric Channel (BSC)Additive White Gaussian Noise (AWGN) F. Farhadzadeh 10 / 40
  • 12. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Performance Analysis Constrained List-Based Decoder [Farhadzadeh et al.(2010)] Binary case rj dH(¯yL ,¯xL ) dH(¯yL ,¯xL (r1)) · · · dH(¯yL ,¯xL (rj−1)) dH(¯yL ,¯xL (rj)) dH(¯yL ,¯xL (rj+1)) · · · dH(¯yL ,¯xL (rNl )) · · · dH(¯yL ,¯xL (rM)) Nl ≤ ηL ⇒ Nl Probability of miss Pm = 1 − Pci = 1 M M w=1 Pr{(w /∈ Nl ) ∪ (Dw > ηL) | Hw } Pci , Probability of correct identification. Probability of false acceptance Pfa = Pr Nl = ∅ | H0 . F. Farhadzadeh 11 / 40
  • 13. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Performance Analysis Probability of False Acceptance [Farhadzadeh et al.(2011)] Pfa = Pr M w=1 Dw ≤ ηL | H0 Proposition For a binary database with M = eLR entries of length L, the probability of false acceptance of the constrained list–based decoder, Pfa, for any Pb < η < 1 2 , satisfies Pfa ≤ exp[−L(ln 2 − H2(η) − R)], where H2(η) = −η log η − (1 − η) log(1 − η). Remarks Pfa is the same as the unique decoder. If R < ln 2 − H2(η), then L → ∞ implies Pfa → 0. F. Farhadzadeh 12 / 40
  • 14. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Performance Analysis Probability of Miss [Farhadzadeh et al.(2011)] Pm =Pr{(1 /∈ Nl ) ∪ (D1 > ηL) | H1} = Pr{(1 /∈ Nl ) ∩ (D1 ≤ ηL) | H1} PI m + Pr{D1 > ηL | H1} PII m Proposition The probability of miss of the constrained list–based decoder, Pm, for any Pb < η < 1 2 , satisfies Pm ≤ exp[−L(ln 2 − H2(η) − R)] Nl + exp[−LD(η Pb)], where D(η Pb) = η log η Pb + (1 − η) log (1−η) (1−Pb) . Remarks For Nl > 1 the first kind of miss probability, PI m, decays faster. If R < ln 2 − H2(η), then L → ∞ implies Pm → 0. F. Farhadzadeh 13 / 40
  • 15. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Numerical Evaluation Feature Extraction ... Block 16 × 16 DCT DCT DCT ... DCT (1, 2) DCT (1, 2) DCT (1, 2) ... ... ... XN Data Statitics Feature domain RP domain Binary domain N = 768 L = 32 L = 32 ρ maxi=j Kij ˜x˜x θ˜X maxi=j Kij ¯x¯x ˆP 0.41 0.08 1.74 0.07 0.5 Remark Projected data are approximately uncorrelated and follow Gaussian distribution. F. Farhadzadeh 14 / 40
  • 16. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Numerical Evaluation Data Statistics Distortion Feature domain RP domain Binary domain Model Parameters maxi=j Kij zz θZ max Kzx maxi=j Kij ˜z˜z θ˜Z max K˜z˜x maxi=j Kij ¯z¯z Pb ˆPb max K¯z¯x AWGN PSNR 5 dB 0.03 2 0.03 0.08 2 0.09 0.08 0.20 0.21 0.09 10 dB 0.03 2 0.03 0.07 2 0.07 0.07 0.12 0.13 0.08 15 dB 0.03 2 0.03 0.08 2 0.08 0.08 0.07 0.08 0.09 20 dB 0.03 2 0.03 0.08 2 0.09 0.08 0.04 0.05 0.09 JPEG QF 1 0.04 1.2 0.09 0.10 1.93 0.10 0.10 0.10 0.11 0.09 10 0.04 1.8 0.05 0.07 1.96 0.08 0.08 0.03 0.04 0.09 25 0.03 1.95 0.06 0.06 1.99 0.09 0.07 0.01 0.01 0.09 Histeq 0.15 0.75 0.49 0.20 1.19 0.30 0.12 0.14 0.1 0.09 Remark Noise approximately follows Gaussian distribution and is independent of content. F. Farhadzadeh 15 / 40
  • 17. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Fingerprint Statistics and Constrained List-Based Decoder Numerical Evaluation Identification Performance 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Pf a Pci PSNR= 5, Nl = 1 PSNR= 20,Nl = 1 PSNR= 5, Nl = 2 PSNR= 5, Nl = 4 Remark The list-based decoder improves the performance in a certain range of list sizes. F. Farhadzadeh 16 / 40
  • 18. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Overview Introduction Fingerprint Statistics and Constrained List-Based Decoder Identification Rate, Search and Memory Complexity Trade-off Active Content Fingerprinting Conclusions and Future Work F. Farhadzadeh 17 / 40
  • 19. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Introduction Search Complexity Reduction source W selector XN (W ) observation channel Y N XN (1), · · · , XN (M) decoder Data- base W Search complexity: the decoder has to check exhaustively all xN (w), 1 ≤ w ≤ M to find the best match. Question: Can we speed up this process? Idea: Do clustering. F. Farhadzadeh 18 / 40
  • 20. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Introduction Search Complexity Reduction To do clustering: assign the database entries to clusters ◦ disjoint clusters To find the best match: yN uN (1) uN (2) uN (w1) uN (M1) xN (1) xN (2) xN (i − 1) xN (i) xN (i + 1) xN (j − 1) xN (j) xN (j + 1) xN (k − 1) xN (k) xN (k + 1) xN (M) F. Farhadzadeh 19 / 40
  • 21. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Introduction Search Complexity Reduction To do clustering: assign the database entries to clusters ◦ disjoint clusters ◦ overlapped clusters To find the best match: yN uN (1) uN (2) uN (w1) uN (M1) xN (1) xN (2) xN (i − 1) xN (i) xN (i + 1) xN (j − 1) xN (j) xN (j + 1) xN (k − 1) xN (k) xN (k + 1) xN (M) F. Farhadzadeh 19 / 40
  • 22. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Introduction Search Complexity Reduction To do clustering: assign the database entries to clusters ◦ disjoint clusters ◦ overlapped clusters To find the best match: ◦ single cluster estimation yN uN (1) uN (2) uN (w1) uN (M1) xN (1) xN (2) xN (i − 1) xN (i) xN (i + 1) xN (j − 1) xN (j) xN (j + 1) xN (k − 1) xN (k) xN (k + 1) xN (M) uN (1) F. Farhadzadeh 19 / 40
  • 23. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Introduction Search Complexity Reduction To do clustering: assign the database entries to clusters ◦ disjoint clusters ◦ overlapped clusters To find the best match: ◦ single cluster estimation ◦ multiple cluster estimation yN uN (1) uN (2) uN (w1) uN (M1) xN (1) xN (2) xN (i − 1) xN (i) xN (i + 1) xN (j − 1) xN (j) xN (j + 1) xN (k − 1) xN (k) xN (k + 1) xN (M) uN (1) uN (2) F. Farhadzadeh 19 / 40
  • 24. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Introduction Search Complexity Reduction To do clustering: assign the database entries to clusters ◦ disjoint clusters ◦ overlapped clusters To find the best match: ◦ single cluster estimation ◦ multiple cluster estimation yN uN (1) uN (2) uN (w1) uN (M1) xN (1) xN (2) xN (i − 1) xN (i) xN (i + 1) xN (j − 1) xN (j) xN (j + 1) xN (k − 1) xN (k) xN (k + 1) xN (M) uN (1) uN (2) Question: What is the fundamental trade-off between the number of cluster-checks and refinement checks? F. Farhadzadeh 19 / 40
  • 25. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Generalized Model Description and Statement of result Generalized Two-Stage Decoding [Farhadzadeh et al.(2013b)] source W selector XN (W ) observation channel first decoder XN (1), · · · , XN (M) Y N second decoder W1(1) W1(M3) . . . W1 W2 combiner W First decoder from Y N , determines W1 = (W1(1), . . . , W1(M3)) and sends them to second decoder. Second decoder from Y N and W1, determines W1 and W2 and sends them to the combiner. The combiner determines index W . F. Farhadzadeh 20 / 40
  • 26. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Generalized Model Description and Statement of result Fundamental Trade-off [Farhadzadeh et al.(2013b)] Theorem The region of achievable quadruple rates of the identification system is given by {(R1, R2, R3, R) : R1 ≥ I(X, Y ; U), R2 ≥ max(0, R − I(X; U)), R3 ≥ I(X; U | Y ), 0 ≤ R ≤ Cid = I(X; Y ), for P(x, y, u) = Qs (x)Qc (y | x)P(u | x, y), where |U| ≤ |Y| · |X| + 2}. Remark We have P(x, y, u) = Qs (x)Qc (y | x)P(u | x, y). The auxiliary random variable U depends on both X and Y . F. Farhadzadeh 21 / 40
  • 27. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Generalized Model Description and Statement of result Achievable Search-Clustering Schemes Clustering disjoint overlapped Decoding single × X ↔ Y ↔ U multiple U ↔ X ↔ Y General X ↔ Y ↔ U: Centroid statistics depend on query statistics [Willems(2009)] U ↔ X ↔ Y : Centroid statistics depend on database entries [J´egou et al.(2011)] General: Centroid statistics depend on both Question: What is the optimal scheme? Idea: Search–Memory complexity analysis. F. Farhadzadeh 22 / 40
  • 28. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Search-Memory Complexity Memory Complexity Exponent [Farhadzadeh et al.(2013b)] uN (1) xN (1, 1) xN (1, 2) xN (1, M2) . . . uN (2) xN (2, 1) xN (2, 2) xN (2, M2) . . . uN (M1) xN (M1, 1) xN (M1, 2) xN (M1, M2) . . . · · · – Clusters: 2NI(X,Y ;U) of uN (w1) – Cluster members: ◦ R > I(U; X): 2N[R−I(U;X)] of xN (w) in each cluster, ◦ R < I(U; X): at most a single xN (w) in each cluster. Me = max I(U; X, Y ) # of clus. + R − I(U; X) # of items in each clus. , I(U; X, Y ) F. Farhadzadeh 23 / 40
  • 29. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Search-Memory Complexity Search Complexity Exponent [Farhadzadeh et al.(2013b)] yN uN (1) xN (1, 1) xN (1, 2) xN (1, M2) . . . uN (2) xN (2, 1) xN (2, 2) xN (2, M2) . . . uN (M1) xN (M1, 1) xN (M1, 2) xN (M1, M2) . . . · · · – First decoder: 2NI(U;X,Y ) cluster checks to construct W1 – Second decoder: ◦ R > I(U; X), 2N[R+I(U;X|Y )−I(U;X)] refinement checks, ◦ R < I(U; X), 2NI(U;X|Y ) refinement checks. Se = max I(U; X, Y ) # of clus. , max R − I(U; X) # of items in each clus. + I(U; X|Y ) # of clus. est. by First dec. , I(U; X|Y ) F. Farhadzadeh 24 / 40
  • 30. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Binary Source Binary Source Example Consider Qs (x) = 1/2, x ∈ {0, 1}, a BSC with cross-over probability q = 0.1, R = 0.5 and U = {0, 1}. 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 Se Me U ↔ X ↔ Y X ↔ Y ↔ U Minimum Se : how? Remark The generalized scheme achieves smaller search-complexity. The minimum search–complexity exponent is larger than R/2. F. Farhadzadeh 25 / 40
  • 31. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Binary Source Minimizing Search Complexity [Farhadzadeh et al.(2014)] Theorem Let U = {0, 1}. The minimum search–complexity exponent S∗ e = (1 − q)(1 − H2(p∗ 1 )) can be achieved if P(y | u) and P(x | u) are BSCs with the same cross-over probability Pb = p∗ 1 q/2, where p∗ 1 = H−1 2 (1 − R/2) − q/2 /(1 − q). S∗ e is achieved if I(U; X) = I(U; Y ) ⇒ I(U; X|Y ) # of clus. est. by First dec. = I(U; Y |X) # of clus. incl. Conclusion F. Farhadzadeh 26 / 40
  • 32. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Identification Rate, Search and Memory Complexity Trade-off Numerical Evaluation Numerical Evaluation Fingerprint extraction ◦ We use an image database of 20, 828 gray–scaled images from ImageNet. ◦ All images are resized to 384 × 512 pixels. ◦ Binary fingerprint of length 64 is extracted from each image, R ≈ 0.22. Identification performance, memory and complexity analysis Distortion PE (%) Clustering Search complexity Me Model Parameters k-medians BBMM k-medians BBMM k-medians BBMM M1 = 180 M1 = 220 M2 ≈ M3 M2 ≈ M3 Se usage(%) Se usage(%) AWGN PSNR 40 dB 0.019 115 30 473 5 0.19 16.73 0.17 10.68 0.22 0.26 30 dB 0.024 115 70 568 6 0.21 38.85 0.18 14.95 0.22 0.26 20 dB 0.389 115 125 946 10 0.22 69.33 0.2 35.14 0.22 0.27 JPEG QF 75 0.010 115 27 378 4 0.18 15.06 0.16 8.67 0.22 0.25 50 0.014 115 30 568 5 0.19 16.73 0.17 12.64 0.22 0.26 25 0.016 115 50 662 7 0.20 27.79 0.19 19.62 0.22 0.27 Histeq 3.140 115 140 1236 13 0.22 87.32 0.21 50.91 0.22 0.28 F. Farhadzadeh 27 / 40
  • 33. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Overview Introduction Fingerprint Statistics and Constrained List-Based Decoder Identification Rate, Search and Memory Complexity Trade-off Active Content Fingerprinting Conclusions and Future Work F. Farhadzadeh 28 / 40
  • 34. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Conventional Passive Content Fingerprinting Main Challenge Complexity Robust fingerprint Binary fingerprint: Bounded Distance Decoder (BDD) 0 1 − Pb 0 1 1 − Pb 1 Pb Pb ¯XL ¯Y L ⇒ ¯yL θL θ ∝ Pb Relatively large Pb ⇒ large Hamming sphere ⇒ high complexity however Robust fingerprint (small Pb) ⇒ small Hamming sphere ⇒ low complexity One solution Active Content Fingerprint: modify content to make its fingerprint (FP) more robust [Voloshynovskiy et al.(2012)]. F. Farhadzadeh 29 / 40
  • 35. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Unidimensional Case of aCFP Shrinkage-Based aCFP (SbaCFP) [Farhadzadeh et al.(2013a)] To make FP more robust Original distribution p(˜x) ˜x−γ +γ less robust bit flip ˜x ϕs (˜x) +γ −γ +γ−γ Modulator function ⇒ 1 2 − Q γ σX 1 2 − Q γ σX p(ϕs (˜x)) ϕs (˜x)−γ 0 +γ Modulated distribution bit flip Pb = Pr sign( ˜Xi ) = sign( ˜Yi ) = E Q |ϕs ( ˜X)| σZ F. Farhadzadeh 30 / 40
  • 36. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Unidimensional Case of aCFP Shrinkage-Based aCFP [Farhadzadeh et al.(2013a)] Implementation: XN WF ˜XN Div- ider ˜XL ϕs (·) ˜XN−L c Com- biner W−1 F V N (modified content) sign ¯XL k Modulator WF ∈ {±1/ √ N}N×N W ⊂ WF Ds = 1 N E XN − V N 2 2 = 2 L N γ 0 (γ − t)2 p˜X (t)dt F. Farhadzadeh 31 / 40
  • 37. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Unidimensional Case of aCFP Analytical comparison 0 5 10 15 20 25 10 −30 10 −20 10 −10 10 0 DNR Pb pCFP LB SbaCFP Comparison of pCFP, SbaCFP and LB in DWM, DWR=24dB and L/N = 0.01. Advantages: ◦ more robust ◦ lower complexity Disadvantages: ◦ quality degradation ◦ still random structure: BDD F. Farhadzadeh 32 / 40
  • 38. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Multidimensional Case of aCFP Lattice-Based aCFP (LbaCFP) [Farhadzadeh et al.(2013a)] From random to structural fingerprint XN ∈ RN W ˜XL ∈ RL Remark Lattice quantizer is used to modify contents to impose structure to ˜XL . ˜XL Use Leech lattice in R24 [Leech(1967)] Largest kissing number Largest packing density Very fast decoder: about 519 operations [Amrani and Beery(1996)] F. Farhadzadeh 33 / 40
  • 39. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Multidimensional Case of aCFP Lattice-Based aCFP [Farhadzadeh et al.(2013a)] Implementation: XN WF ˜XN Div- ider ˜XL ϕΛ(·) ˜XN−L c Com- biner W−1 F V N (modified content) ¯xL k Modulator WF ∈ {±1/ √ N}N×N W ⊂ WF DΛ = 1 N E XN − V N 2 2 = L N G(Λ, V)V 2/L normalized second moment volume of Voronoi region V Advantages: ◦ more robust ◦ very low complexity Disadvantages: ◦ quality degradation ◦ non-uniform distribution F. Farhadzadeh 34 / 40
  • 40. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Numerical Evaluation Numerical Evaluation Fingerprint extraction We employ a real image database of 1, 338 gray–scaled images from UCID [Schaefer and Stich(2004)]. Binary fingerprint is extracted from each modulated image. Different modulation schemes Modulators Parameters PSDR MSSIM SbaCFP L = 192, γ = 60 53 dB 0.999 L = 32, γ = 110 53 dB 0.999 LbaCFP scale= 70, L = 24 53 dB 0.999 F. Farhadzadeh 35 / 40
  • 41. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Numerical Evaluation Performance Analysis Unidimensional case Modulator Distortion Parameters Performance im. domain proj. domain Pci Pfa ˆPb Pb Pb−pCFP Pci−pCFP SbaCFP L = 32 θ = 2/L γ = 110 AWGN PSNR=20dB DNR=18dB 1 0 0 0 0.05 0.80 15dB 13dB 0.998 0 0.004 0.003 0.08 0.55 10dB 8dB 0.76 0 0.05 0.05 0.13 0.21 5 B 3dB 0.15 0 0.15 0.14 0.21 0.05 JPEG QF=25 27dB 1 0 0 0 0.01 0.98 10 20dB 1 0 0 0 0.04 0.86 1 10dB 0.92 0 0.03 0.02 0.11 0.33 Histeq 4dB 0.94 0 0.02 0.12 0.14 0.43 Remark Pb’s of active content fingerprinting show remarkable improvements with respect to pCFP that leads to high Pci . F. Farhadzadeh 36 / 40
  • 42. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Active Content Fingerprinting Numerical Evaluation Performance Analysis Multidimensional Case Distortion Parameters Pci image domain projection domain SbaCFP LbaCFP L = 192, θ = 2/L L = 24 AWGN PSNR=20 dB DNR=18 dB 0.96 1 15 dB 13 dB 0.07 0.83 10 dB 8 dB 0 0.01 5 dB 3 dB 0 0 JPEG QF=25 27 dB 1 1 10 19 dB 0.998 1 1 10 dB 0 0.14 Histeq 6 dB (LbaCFP 3 dB) 0.3 0.11 Remark Except Histeq, LbaCFP outperforms unidimensional modulation schemes. F. Farhadzadeh 37 / 40
  • 43. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Conclusions and Future Work Overview Introduction Fingerprint Statistics and Constrained List-Based Decoder Identification Rate, Search and Memory Complexity Trade-off Active Content Fingerprinting Conclusions and Future Work F. Farhadzadeh 38 / 40
  • 44. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Conclusions and Future Work Conclusions We introduced an identification setup based on the constrained list-based decoder and analyzed its performance. We analyzed a simple digital fingerprinting approach based on random projections. Random projections not only can reduce the data dimensionality but can also eliminate correlation among data samples. We investigated a two–stage decoding scheme capable of achieving the identification capacity with the search complexity less than the exhaustive search. We presented active content fingerprinting taking the best of content fingerprinting and digital watermarking to overcome some of the fundamental restrictions of these techniques. F. Farhadzadeh 39 / 40
  • 45. Information-Theoretic Analysis of Identification Systems in Large-Scale Databases Conclusions and Future Work Future Works Identification based on unique representative −→ multiple representative First order autoregressive process −→ higher order autoregressive process Two–stage decoding in identification −→ information retrieval Identification rate, search and memory complexity trade-offs −→ including security and privacy leakage trade–offs F. Farhadzadeh 40 / 40
  • 46.
  • 47. References Amrani, O., Beery, Y., 1996, Efficient bounded-distance decoding of the hexacode and associated decoders for the leech lattice and the golay code, IEEE Trans. on Com., 44, 534 –537 Farhadzadeh, F., Voloshynovskiy, S., Koval, O., 2010, Performance analysis of identification system based on order statistics list decoder, in Proc. of IEEE International Symposium on Information Theory (ISIT), Austin, TX Farhadzadeh, F., Voloshynovskiy, S., Koval, O., Beekhof, F., 2011, Information-theoretic analysis of content based identification for correlated data, in IEEE Information Theory Workshop (ITW), pp. 205–209, Paraty, Brazil Farhadzadeh, F., Voloshynovskiy, S., Holotyak, T., Beekhof, F., 2013a, Active content fingerprinting: Shrinkage and lattice based modulations, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada Farhadzadeh, F., Willems, F. M., Voloshynovskiy, S., 2013b, Fundamental limits of identification: Identification rate, search and memory complexity trade–off, in IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey Farhadzadeh, F., Sun, K., Ferdowsi, S., 2014, Efficient two stage decoding scheme to achieve content identification capacity, submitted J´egou, H., Douze, M., Schmid, C., 2011, Product quantization for nearest neighbor search, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33, 117–128 Leech, J., 1967, Notes on sphere packings, Canadian Journal of Mathematics Schaefer, G., Stich, M., 2004, Ucid - an uncompressed colour image database, in Storage and Retrieval Methods and Applications for Multimedia, Proc.of SPIE Voloshynovskiy, S., Farhadzadeh, F., Koval, O., Holotyak, T., 2012, Active content fingerprinting: a marriage of digital watermarking and content fingerprinting, in IEEE WIFS, Tenerife, Spain Willems, F., 2009, Searching methods for biometric identification systems: Fundamental limits, in IEEE International Symposium on Information Theory (ISIT), pp. 2241 –2245 Willems, F., Kalker, T., Goseling, J., Linnartz, J., 2003, On the capacity of a biometrical identification system, in Proc. of IEEE International Symposium on Information Theory, p. 82