2. Leakage Problem
Stanford Infolab 2
App. U1 App. U2
Jeremy Sarah Mark
Other Sources
e.g. Sarah’s Network
Name: Mark
Sex: Male
….
Name: Sarah
Sex: Female
….
Kathryn
5. Problem Entities
Entity Dataset
Distributor
Facebook
T
Set of all Facebook profiles
Agents
Facebook Apps U1, …, Un
R1, …, Rn
Ri: Set of people’s profiles who have
added the application Ui
Leaker
S
Set of leaked profiles
Stanford Infolab 5
6. Agents’ Data Requests
• Sample
– 100 profiles of Stanford people
• Explicit
– All people who added application
(example we used so far)
– All Stanford profiles
Stanford Infolab 6
8. Guilt Models (1/3)
Stanford Infolab 8
Other Sources
e.g. Sarah’s
Network
8
p
p: posterior probability that a leaked profile
comes from other sources
p
Guilty Agent: Agent who leaks at least one profile
Pr{Gi|S}: probability that agent Ui is guilty, given
the leaked set of profiles S
9. Guilt Models (2/3)
Stanford Infolab 9
9
or
or
Agents leak each of their
data items independently
Agents leak all their data
items OR nothing
or
(1-p)2
(1-p)p
p(1-p)
p2
12. The Distributor’s Objective (1/2)
Stanford Infolab 12
U1
U2
U3
U4
R1
Pr{G1|S}>>Pr{G2|S}
Pr{G1|S}>> Pr{G4|S}
S (leaked)
R1
R3
R2
R3
R4
13. The Distributor’s Objective (2/2)
• To achieve his objective the distributor has to
distribute sets Ri, …, Rn that
minimize
• Intuition: Minimized data sharing among
agents makes leaked data reveal the guilty
agents
Stanford Infolab 13
n
j
i
R
R
R
i i
j
j
i
i
,...,
1
,
,
1
14. Distribution Strategies – Sample (1/4)
• Set T has four profiles:
– Kathryn, Jeremy, Sarah and Mark
• There are 4 agents:
– U1, U2, U3 and U4
• Each agent requests a sample of any 2 profiles
of T for a market survey
Stanford Infolab 14
15. Distribution Strategies – Sample (2/4)
Poor
j
i
j
i R
R
Minimize
Stanford Infolab 15
U1
U2
U3
U4
U1
U2
U3
U4
16. Distribution Strategies – Sample (3/4)
• Optimal Distribution
• Avoid full overlaps and minimize
Stanford Infolab 16
U1
U2
U3
U4
i i
j
j
i
i
R
R
R
1
18. Distribution Strategies
Sample Data Requests
• The distributor has the
freedom to select the data
items to provide the agents
with
• General Idea:
– Provide agents with as much
disjoint sets of data as possible
• Problem: There are cases
where the distributed data
must overlap E.g.,
|Ri|+…+|Rn|>|T|
Explicit Data Requests
• The distributor must
provide agents with the
data they request
• General Idea:
– Add fake data to the
distributed ones to minimize
overlap of distributed data
• Problem: Agents can collude
and identify fake data
• NOT COVERED in this talk
Stanford Infolab 18
19. Conclusions
• Data Leakage
• Modeled as maximum likelihood problem
• Data distribution strategies that help identify
the guilty agents
Stanford Infolab 19