Talk Benjamin NGUYEN

PR
SM
PRiSM Lab. - UMR 8144
Privacy Preserving SQL Query
Execution on Distributed Data
Quoc-Cuong To, Benjamin Nguyen, Philippe Pucheral
SMIS Project
LaHDAK Seminar
Orsay
4th
March 2014
Université de Versailles et St-Quentin
INRIA Rocquencourt
CNRS

PR
SMPRiSM Lab. - UMR 8144
PART I
The New Oil
I. The New Oil
II. Trusted Cells
III. Global SQL Queries
IV. Cost Model and
Experiments
V. Conclusion

PR
3
Mass-generation of
(personal) data
Data sources have mostly turned digital
Analog processes
e.g., photography, films
Paper-based interactions
e.g., banking, e-administration
Communications
e.g., email, SMS, MMS, Skype
Where is your personal data? … In data centers
112 new emails per day  Mail servers
65 SMS sent per day  Telcos
800 pages of social data  Social networks
Web searches, list of purchases  google, amazon
People recording
People listnening
St Peter's Place, Roma
WHY ?Is this
a problem ?
/ 41
Everything is free… 
Information
Extraction

PR
4
Personal data is the new oil
Is this good news ?
 $2 billion a year spend by US companies
on third-party data about individuals
(Forrester Report)
 $44.25 is the estimated return on $1
invested in email marketing (oil is up to 0.5$/yr)
High Market Value Companies
 Facebook: value / #accounts ≈ 50$
 Google: $38 billion business sells ads based on how people search the Web
 Amazon (knows purchase intent), mail order systems companies (gmail), loyalty
programs (supermarkets), banks & insurrance, employement market (linkedIn,
viadeo), travel & transportation (voyages-sncf), the « love » market (meetic), etc.
/41

PR
5
Personal data is the new oil
How would oil companies behave ?
• Exploit your oil field for free  Know all about you
• Offer “extra” services  Refine their knowledge
• Provide real services to their paying customers
(e.g. advertisement and profiling, location tracking and spying, …)
In other words : your personal data would be
processed by sophisticated data refineries…
REGARDLESS OF YOUR PRIVACY !
It’s the business model…
… or bad news ?
/41
Their choice
Your choice

PR
6
Is the current centralised model good wrt privacy
protection?
Intrinsic problem #1: personal data is exposed to sophisticated attacks
–High benefits to successful hack
–One person negligence may affect millions
Intrinsic problem #2: personal data is hostage of sudden privacy changes
–Centralised administration of data means delegation of control
–This leads to regular changes, with application (and business)
evolution, with mergers and acquisition, etc. (e.g Facebook 2012)
Increasing security is only a partial solution since does not solve those
intrinsic limitations
E.g., TrustedDB [BS12] proposes tamper-resistant hardware to secure
outsourced centralized databases.
/41

PR
7
A New Hope
A Personal Data Ecosystem…
… built around user-centricity and trust,
achieved through a decentralized architecture
7
THE TRUSTED CELL !
I want my
privacy
back !!
/41
Our goals :
 Preserve current USER functionalities
 Hinder uncontrolled data exploitation & privacy violations
Our targets :
 General Data Management Applications : SQL
 “Low cost” solutions (i.e. acceptable by general public)

PR
PART II
Trusted Cells
I. The New Oil
II. Trusted Cells
IV. Cost Model and
Experiments
V. Conclusion

PR
9
The Secure (Trusted) Personal Data Server Approach
[AAB+10]
9
Personal database is
• Well-organized
• Tamper resistant
• Controlled by the owner
(sharing, retention, audit)
• Accessible in disconnected
mode
Personal
data
Hospital
Doctor’ s office
My bank
My employer
My telco Private
application
Secure
multi-actors
application
Secure
global queries
External
application
e.g., epidemiological study
e.g., medical care
Doctor
Nurse
e.g., budget optimization
e.g., financial help
Bob
Approach characteristics :
• Based on tamper-resistant HW
• Well Structured World (R-DB, limited apps)
• Uniform equipment
TRUSTED CELL
FLASH
(GB size)
RAM
FLASH
NOR
CPU Crypto
PDS generic code
Relational DBMS
Operating System
Certified Encrypted
Personal
Database
/41

PR
10
Why trust personal secure HW solutions?
1. Users store their own data
 minimize abusive usage
2. Auto-administered platform
 no DBA attack (even by user)
3. Enforce privacy principles for
externalized (shared) data
 best if the recipient of the data is
another TC
4. Tamper-resistance + certified
code/secure execution + single
user + physical access needed
 ratio cost/benefit of an attack is
very high
Tamperresistance
Gemalto secure token
SMIS token (ZED)
Trust Zone architecture
Dedicated HW device
PC ? (social trust / open
source)
/41

PR
11
The Trusted Cell Asymmetric Architecture
Durability,
Availability
Secure
Computation
Export Data
11
TC asymmetric architecture
Built using Secure Portable Tokens as Trusted Cells (called here Trusted Data Server or TDS) /
Cloud as Supporting Server Infrastructure (SSI).
Challenges :
Local (Embedded) data management (not my work : Anciaux, Bouganim, Pucheral et al.)
Global querying (Part III)
Data export management (MinExp Project with CG78 & LIX)
HIGH POWER / AVAILABILITY
LOW / NO TRUST
LOW POWER / AVAILABILITY
HIGH TRUST
ASYMMETRIC
Encrypted
Private Data Generated
(e.g. sensor)
/41

PR
PART III
Global SQL Queries on the Asymmetric
Architecture
I. The New Oil
II. Trusted Cells
IV. Cost Model and
Experiments
V. Conclusion

PR
13
Example Trusted Cell : a Trusted Data Server (TDS)
Token Characteristics :
• High security:
• High ratio Cost/Benefit of an attack;
• Secure against its owner;
• Modest computing resources (~10Kb of RAM, 50MHz
CPU);
• Low availability: physically controlled by its owner;
connects and disconnects at it will
13
How to compute global queries over decentralized
personal data stores while respecting users’ privacy?
Authorized
Querier
Average Salary
in Orsay
Unauthorized
Querier

PR
14
TC can be :
Unbreakable (honest)
Broken (Weakly Malicious)
nfrastructure (SSI) can be :
Honest but curious (Semi-honest)
Weakly-Malicious (Covert Adversary
= does not want to be detected)
Secure Global Computation on TCs
PROBLEM :
How to perform global queries on the asymmetric
architecture? (i.e. using data from many/all cells)
The « classical » problem of Secure Global Computation (e.g SMC) is
more general and makes no trust assumption.
THREAT MODEL :
/41
HBC + Unbreakable  “simple protocols” presented here (EDBT’14 [TNP14])
WM + Broken  Must be prevented ! (via security primitives) see [ANP13]

PR
15
Is this a new problem ?
Several approaches are possible to securely perform global computations:
1. Use only an untrusted server/cloud/P2P and use generic (and costly) algorithms.
(e.g. Secure Multi-Party Computing [Yao82, GMW87, CKL06], fully homomorphic
encryption [Gent09]) Problem = COST
2. Use only an untrusted server/cloud/P2P and develop a specific algorithm for
each specific class of queries or applications. (e.g. DataMining Toolkit [CKV+02])
Problem = GENERICITY
3. Introduce a tangible element of trust, through the use of a trusted
component and develop a generic methodology to execute any centralized
algorithm in this context. ([Katz07, GIS+10, AAB+10])  Problem = TRUST
/41

PR
16
Hypothesis on Querier and SSI
Querier:
• Shares the secret key with TDSs (for encrypt the query & decrypt
result).
• Classical Access control policy (e.g. RBAC):
– Cannot get the raw data stored in TDSs (get only the final result)
– Can obtain only authorized views of the dataset ( do not care about inferential attacks)
Supporting Server Infrastructure:
• Doesn’t know query (so, attributes in GROUP BY clause) b/c query is
encrypted by Querier before sending to SSI.
• Has prior knowledge about data distribution.
• Honest-but-curious attacker: Frequency-based attack
– SSI matches the plaintext and ciphertext of the same frequency.
e.g. investigates remarkable (very high/low) frequencies in dataset distribution
(e.g., X is the only person with a given (high) age and still working and earning money → if I
find a group with only one member I can deduct that X participates in the dataset). 16

PR
17
Solution Overview
17
1) Query
Supporting Server
Infrastructure (SSI)
…
SELECT <attribute(s) and/or aggregate function(s)>
FROM <Table(s) / SPTs>
[WHERE <condition(s)>]
[GROUP BY <grouping attribute(s)>]
[HAVING <grouping condition(s)>]
[SIZE <size condition(s)>];
2) Collection and
Filtering phase
3) Aggregation phase
Stop condition: max #tuples or max time
John, 35K Mary, 43K Paul, 100K
SELECT age, AVG(salary)
FROM user
WHERE town = “Orsay”
GROUP BY age
HAVING MIN(salary) > 0
SIZE
4) Aggregate
Filtering phase

PR
18
Proposed Solutions
The main difficulty is with AGGREGATE QUERIES !!
Solutions vary depending on which kind of encryption is used, how
the SSI constructs the partitions, and what information is revealed to
the SSI.
• Secure aggregation solution
• Noise-based solutions
– random (white) noise
– noise controlled by the complementary domain
• Histogram-based solutions
We investigate these solutions along the directions of
performance and security. 18

PR
19
Secure Aggregation
19
Supporting Server
Infrastructure (SSI)
…
encrypts its data using
non-deterministic
encryption
Form partitions (fit resource of a TDS)
Hold partial aggregation (Gij,AGGk)
Querier
}
(25y, Orsay, 35K)
(#x3Z, aW4r)
(45y, Orsay, 43K) (53y, Paris, 100K)
Q: SELECT Age, AVG(Salary)
WHERE city = Orsay
GROUP BY Age
HAVING Min(Salary) > 0
($f2&, bG?3)
No answer ?
(#x3Z, aW4r)
($f2&, bG?3)
($&1z, kHa3)
…
(T?f2, s5@a)
(#i3Z, afWE)
(T?f2, s!@a)
($f2&, bGa3)
(#x3Z, aW4r)
($f2&, bG?3)
($&1z, kHa3)
(?i6Z, af~E)
(T?f2, s5@a)
(5f2A, bG!3)
(25, 35K)
(45, 43K)
(45, 37K)
(25, [35K,1])
(45, [40K,2])
(F!d2, s7@z)
(ZL5=, w2^Z)
Final Agg
(#f4R, bZ_a)
(Ye”H, fw
%g)
(@!fg, wZ4#)
(25, 29.5K)
(45, 43.7K)
…
Evaluate HAVING clause
Final Result
(#f4R, bZ_a)
(Ye”H, fw%g)
Qi= <EK1(Q),Cred,Size>
Decrypt Qi
Check AC rules
Decrypt Qi
Check AC rules
Decrypt Qi
Check AC rules

PR
20
Noise Based Protocols
Secure Aggregation Efficiency problem :
nDet_Enc on AG  SSI cannot gather tuples belonging to
the same group into same partition.
But :
Det_Enc on AG  frequency-based attack.
Idea :
Add noise (fake tuples) to hide distribution of AG.
How many fake tuples (nf) needed?  disparity in
frequencies among AG
– small nf: random noise
– big nf: white noise
– nf = n-1: controlled noise (n: AG domain cardinality)
Efficiency:
– Each TDS handles tuples belonging to one group (instead of large partial
aggregation as in SAgg)
– However, high cost of generating and processing the very large number
of fake tuples

PR
21
Nearly Equi-Depth Histogram Solution
1. Distribution of AG is discovered
and distributed to all TDSs.
2. TDS allocates its tuple to
corresponding bucket.
3. TDS send to SSI:
{h(bucketId),nDet_Enc(tuple)}
Consequences :
21
 We do not generate & process too
many fake tuples
 We do not handle too large partial
aggregation
True Distribution Nearly equi-depth histogram
Problem : Distribution must be discovered
 This can be done “offline” using secure
aggregation !

PR
22
Information Exposure Analysis (DCJP+03)
22
To measure Information Exposure, we consider the probability that an attacker (here the Honnest but Curious SSI) can reconstruct the plaintext table (or part of the table) using the
encrypted table and his prior knowledge about global distributions of plaintext attributes.
Information Exposure is noted :
• n is the number of tuples
• k is the number of attributes
• ICi,j is the value in row i and column j of the inverse cardinality ( = 1/number of plaintext values that could correspond)
• Nj is the number of distinct plaintext values in the global distribution of attribute in column j (i.e., Nj ≤ n).
,
1 1
1 kn
i j
i j
IC
n
ε
= =
= ∑∏

PR
23
23
_
1 1 1
1 1
1/
k kn
S Agg j
i j jj
N
n N
ε
= = =
= =∑∏ ∏SAgg: ICi,j = 1/Nj for all i,j
•n: the number of tuples,
•k: the number of attributes,
•ICi,j : IC for row i and column j
•Nj: the number of distinct plaintext
values in the global distribution of
attribute in column j (i.e., Nj ≤ n)
_
1
min( ) 1/
k
ED Hist j
j
Nε
=
= ∏
EDHist: requires finding all possible partitions of
the plaintext values such that the sum of their
occurrences is the cardinality of the hashed
value: NP-Hard multiple subset sum problem
Noise_based & ED_Hist have a uniform distribution of the AG: ɛED_Hist = ɛNoise_based
Plaintext: _
1 1
1
1 1
kn
P Text
i jn
ε
= =
= =∑∏ ɛS_Agg ≤ ɛED_Hist =ɛNoise_based <1
Information Exposure Analysis (Damiani et al. CCS 2003)

PR
PART IV
Cost Model and experiments
I. The New Oil
II. Trusted Cells
IV. Cost Model and
Experiments
V. Conclusion

PR
25
Unit Test Calibration
25
Internal time consumption
Eval Board
•32 bit RISC CPU: 120 MHz
•Crypto-coprocessor: AES, SHA
•64KB RAM, 1GB NAND-Flash
•USB full speed: 12 Mbps
} SMIS developped token (ZED electronics)
Same technical characteristics
Price = 50 EUR (small series)

PR
26
Parameters for cost model
Dataset size Ttuple : varies from 5 to 65 million
Number of groups G : varies from 1 to 106
Number of TDSs participating in the computation as a percentage of all TDSs
connected at a given time Ttds : varies from 1% to 100%).
We fix two parameters and vary the other, measuring : execution time,
parallelism of the protocol, total load, maximum load on one TDS
When the parameters are fixed :
Ttuple =106
, G=103
, % of TDS connected = 10% of Ttuple.
We also compute and use the optimal value for all reduction factors as
well as for.
In the figures, we plot two curves for Rnf_Noise protocols RN (nf = 2) and
WN (nf = 1000) to capture the impact of the ratio of fake tuples.

PR
27
EXECUTION TIME
27
Ttuple=106
; G=1-106
Ttuple=5.106
- 35.106
; G=1000
Naïve, noise-based, ED&EW:
•G increases, Ttuple fixed  Number of tuples in each
group decreases
•Depend only on the total number of tuples in each group
(because all groups are processed in parallel)  exeTime
decreases when G increases.
Secure Count:
•G increases  time for processing the big partial
aggregation increases accordingly.
•Cannot fully deploy the parallel computation (cannot
divide each group for TDSs in parallel, each TDS has to
handle the whole G groups)  exeTime increases
Naïve, RN, ED&EW:
•Ttuple increases, Ttds increases accordingly  not
much changes
Secure Count:
• Number of recursive steps increases when Ttuple
increases.  exeTime increase
WN,CN:
• Number of fake tuples increases
linearly with the number of true tuples.
 exeTime also increases linearly to
handle the fake & true tuples

PR
28
NUMBER OF PARTICIPATING TDSS
28
Ttuple=106
; G=1-106
Ttuple=5.106
- 35.106
; G=1000
Secure Count:
•G increases  level of convergence is low & the size of
each aggregation is big
 need less participating TDSs to build the aggregations
to gain the high convergence level
Other solutions:
• Since each group is processed in parallel and
independently  when G increases, the level of
parallelism increases
 more TDSs are needed to participate in the parallel
computation
WN, CN:
• When true Ttuple increases, the fake tuples
increases as well  more TDSs are needed to
process fake tuples
Secure Count:
• Level of parallelism is less than other solutions
 needs least TDS

PR
29
TOTAL LOAD (NETWORK OVERHEAD)
29
Ttuple=106
; G=1-106
Ttuple=5.106
- 35.106
; G=1000
Noised-based:
• Highest load because of the fake tuples
• When G increases but Tpds does not change
number of tuples (both true and fake) do not change
total load is the same
Others:
Lower load since handle only true data
Noised-based:
• When true Ttuple increases, the
fake tuples increases linearly
total load is highest and
increases

PR
30
MAXIMUM LOAD
30
Ttuple=106
; G=1-106
Ttuple=5.106
- 35.106
; G=1000
Secure Count:
•When G increases, size of each aggregation is big
each PDS process bigger aggregation
•When G increases, number of participating PDSs decrease
 each participating PDS incurs higher load
Others:
•When G increases, number of participating PDSs decrease &
number of tuples in each group decreases
each PDS process less tuples  maxLoad decrease
WN, CN:
•Use all available PDSs
 maxLoad increases linearly when Ttuple
increases
Others:
when Ttuple increases, the number of
participating PDSs also increase accordingly
 in general, the maxLoad does not
increase too much

PR
31
AVERAGE LOAD
31
Ttuple=106
; G=1-106 Ttuple=5.106
- 35.106
; G=1000
Secure Count:
•Total load is unchanged but the number of participating
TDSs is reduced when G increases
 the average load increases.
WN,CN:
•High total load is the same & all PTpds=10^5 participate
in the computation
 every PDSs incur the same amount of load
Others:
•G increase, more participating PDSs & total load
unchanged  AvgLoad decreases
Although: TotalLoad(CN) > TotalLoad(SC)
PTpds(CN) >> PTpds(SC)
 AvgLoad(CN) < AvgLoad(SC)

PR
32
CONSUMED MEMORY
32
Actual RAM size of TDS
Noise-based:
•Need to store only 1 group regardless of G
 Require least RAM.
Histogram-based:
•Each PDS store h groups (h>1) regardless of G
 Require higher RAM
SC:
•Each PDS store all G groups
•When G increases, RAM needed increases
 Require highest RAM
•Exceed actual RAM’s size  future work

PR
33
AVERAGE TIME FOR PDS TO CONNECT
33
Ttuple=106
; G=1-106
Ttuple=5.106
- 35.106
; G=1000
Secure Count:
•The number of participating PDSs is reduced when G
increases
 the average time increases.
WN,CN:
•High total load is unchanged & all PTpds=10^5
participate in the computation
 every PDSs take the same amount of time to process
data
Others:
•G increase, more participating PDSs
 AvgTime decreases
High AvgTime:
•WN,CN: because of too many fake tuples
•SC: because of very few participating PDSs

PR
34
Theoretical Scalability
34
Tpds = 1%Ttuple Tpds = 10%Ttuple
Tpds = 100%Ttuple
Secure Count: has a (low) maximum
number of participants.
Others: WN have higher scalability than
others (in the sense that adding
participants count)

PR
35
Experimental Scalability

PR
36
COMPARISON WITH OTHER STATE-OF-THE-ART
METHODS
36
Hardware:
•Linux workstation;
•AMD Athlon-64 2Ghz processor;
•512 MB memory
•SC: depends mostly on G
(slightly on Ttuple)
•Others: not depends on G, but
mostly on Ttuple
Answering aggregation queries in a secure system model. (Ge & Zdonic, VLDB 2007)
DES: each value is decrypted and the computation is performed on the plaintext.
Server must have access to secret key & plaintext (violates security requirements)
Paillier: perform computation directly on the ciphertext using a secure
homomorphic encryption scheme: enc(a + b) = enc(a) + enc(b)
Server performs computation without having access to the secret key
or plaintext. In the end, ciphertext are passed back to the trusted
agent (i.e., Key Holder) to perform a final decryption and simple
calculation of the final result

PR
37
Metrics for the evaluation of the proposed solutions
37
Total Load
Average Time/Load
Query Response Time
Information Exposure
Throughput
Resource Variation

PR
38
Trade-off between criteria
38
Select ..
From ..
Where ..
Group By AG
G = card (AG)
Security:
S_Agg > ED_Hist
Performance:
G > 10:
ED_Hist > S_Agg
G <= 10:
ED_Hist < S_Agg

PR
PART V
Conclusion and perspectives
I. The New Oil
II. Trusted Cells
IV. Cost Model and
Experiments
V. Conclusion

PR
40
Short/Middle term research :
Data intensive Computing on an Asymmetric Architecture
SQL
Queries here do not have joins !
Take into account Malicious SSI / Broken Tokens
Field experiment on usability (with ISN)
Private/Secure MapReduce
Investigate compatibility of our protocols.
Develop new protocols.
Check performance !
XML management
Adapt the work on XQ2P (Butnaru, Gardarin, Nguyen) to the Trusted
Cells context.
Distributed Window Queries.
/41

PR
41
Promoting the Trusted Cells vision
Trusted Cells “Core”
Open hardware and software bundle : basic functionalities
Local DB
Distributed DB
NoSQL DB
 needed to develop PbD personal data management applications !
Promote an open source community around Trusted Cells.
UVSQ FabLab
Bring secure data management to the Versailles FabLab
Beyond Tamper Resistant HW
Results are useable even with lower trust elements.
Include social trust / reputation.
/41

PR
QUESTIONS ?
42

Talk Benjamin NGUYEN

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (7)

Similar to Talk Benjamin NGUYEN

Similar to Talk Benjamin NGUYEN (20)

More from INRIA-OAK

More from INRIA-OAK (20)

Recently uploaded

Recently uploaded (20)

Talk Benjamin NGUYEN

Editor's Notes