Cast

PRESERVING PRIVACY IN
SEMANTIC-RICH TRAJECTORIES
OF HUMAN MOBILITY
Anna Monreale, Roberto Trasarti, Dino
Pedreschi, Chiara Renso
KDDLab, Pisa
Vania Bogorny
Univ. Santa Catarina, Brasile
1
Knowledge Discovery and Delivery Lab
(ISTI-CNR & Univ. Pisa)
www-kdd.isti.cnr.it
ANONIMO MEETING, Pisa, 20,21 settembre 2010
SPRINGL 2010, San Jose, November 2, 2010

How the story begins…
2 Semantic
trajectories
represent the
important places
visited by people
Semantic
trajectories
represent the
important places
visited by people
This information can
be privacy sensitive!
We should find a
good generalization
of the visited
places… preserving
semantics!
But how?
This information can
be privacy sensitive!
We should find a
good generalization
of the visited
places… preserving
semantics!
But how?
Can we use a taxonomy
of places to generalize
and find anonymous
datasets?
Let’s ask help to Anna,
Dino and Roberto!
Can we use a taxonomy
of places to generalize
and find anonymous
datasets?
Let’s ask help to Anna,
Dino and Roberto!

Semantic Trajectories
 Availability of trajectory data increases
 From raw trajectories to new forms of trajectory data with
richer semantic information: semantic trajectories
 Semantic trajectories represents moving objects traces as
sequences of stops and moves
 A semantic trajectory can be represented as the sequence
of stops, e.g.
<Home, Work, ShoppingCenter, Gym>

Semantic Trajectory and
Privacy
 Data owner should not reveal personal sensitive
information
 Disclosure of personal sensitive information puts
the citizen’s privacy at risk.
 Hiding personal identifiers may not be sufficient
 Need for new privacy-preserving DT techniques
 Privacy by Design
 Natural trade-off between privacy quantification
and data utility
 Analysis results should not be altered significantly
 Privacy has to be maximized

Semantic Trajectories Analysis and
Privacy Issues
 Analyzing datasets of semantic trajectories
may cause privacy issues
 A place allows to infer personal sensitive
information of an individual
 Example: From the fact that a person has
stopped in an oncology clinic, an attacker can
derive private personal information about the
health of such person.
5

Semantic Trajectories Analysis
and Privacy Issues
k-anonymity is not enough for a robust protection
When individuals with similar trajectories stop in
the same sensitive place, we can easily infer
the individual sensitive information.
Example:
#U1 <Park, Restaurant, Oncology Clinic>
#U2 <Park, Restaurant, Oncology Clinic>
This dataset is 2-anonymous but the attacker can
infer that the user has been to the Oncology
Clinic!!!
6

The Privacy Framework
 Anonymizes dataset of semantic trajectories
 Based on semantic generalization and the
notion of c-safety - similar to the notion of l-
diversity in relational, tabular data
 It is based on: a taxonomy of places, the notion of
quasi identifier places and sensitive places.
 Preserves patterns mining results

Quasi-identifier and Sensitive
stops8
 The taxonomy of places
 Represents important places and their semantic
categories in a given domain
 quasi-identifier places: can be used to infer the
identity of the user
 sensitive places: can disclose sensitive
information about the user
 In general we don’t have an apriori
classification since it depends on the
application and the context

Privacy Model
10
 Adversary Knowledge:
 how we anonymize the data
 the privacy place taxonomy describing the levels of
abstraction
 the user U is in the dataset
 a quasi-identifier place sequence SQ visited by the user
U
 Attack Model:
 Given SQ, the attacker builts a set of candidate semantic
trajectories containing SQ and tries to infer the sensitive
places visited by U.
 We denote by Prob(SQ,S) the probability that, given a
quasi-identifier place sequence SQ related to a user U,
the attacker infers the sequence of sensitive places S
visited by the user.

C-Safe Dataset
We want to control the probability Prob(SQ, S)
 A dataset ST is said c-safe wrt the place set Q if
for every quasi-identifier place sequence SQ,
we have that for each set of sensitive place S
Prob(SQ,S) ≤ c with c ∈ [0,1].
 Given a sequence of sensitive places S = s1, . . .
, sh and a quasi-identifier sequence SQ the
probability to infer S is the conditional
probability:
P(SQ,S) = P(S|SQ)
11

How we can obtain a c-safe dataset?
12
The CAST (C-safe Anonymization of Semantic
Trajectories) algorithm guarantees that P(S|SQ)
≤ c for each sequence of S and SQ
While (|S|>0)
SL = { s ∈ S| length(s) = MaxLength(S)}
While (|SL| >= m)
1. Compute the Cost of all possible group Gi of m
sequences in SL as: CostGi = CostQGi + CostSGi.
2. Apply the generalization with the lower Cost
storing the results in R.
3. Remove Gi from S and SL.

Example (1): The process
13
Consider the following set of sequences, and m=3 and c=0.45:
S = {<S1, R2, H1, R1, C1, S2>
<S3, D1, R1, C1, S2>
<S1, P3, C2, D2, S2>
…}

Example (2) CostQ
14
CostQ is the number of hops on the tree needed to generalize the
sequences of Quasi-identifiers to a common one.
Consider the group:
<S1, R2, H1, R1, C1, S2>
<S3, D1, R1, C1, S2>
<S1, P3, C2, D2, S2>
CostQ = 6 + 6 + 6 = 18
<Station,Place,Entertainment,S2 (H1,C1)>
<Station,Place,Entertainment,S2 (C1)>

Example (2) CostS
15
CostS is the number of hops on the tree needed to generalize the
sequence of Sensible in order to obtain the c-safety.
From the generalized group:
CostS = 3
The Total Cost of this
group is 21 hops,
which is the lower
combination
<Station, Place, H1, Entertainment, Clinic,
S2 >
<Station, Place, Entertainment, Clinic, S2>
<Station, Place, Clinic, Entertainment, S2>

Example (4): Why is C-safe
SQ = Station, Place, Entertainment, S2 .⟨ ⟩
Probability of crack: P (SQ , H1 ) = 1/3 <c , P(SQ,C1) = 2/3 > c and
P(SQ,C2) = 1/3 <c
We need to generalize C1 to the higher representation level in the
taxonomy: Clinic.
The probability of C1 become 2/5 < c !!!!
C-safe dataset:
<Station, Place, H1, Entertainment, Clinic, S2 >
<Station, Place, Entertainment, Clinic, S2>
<Station, Place, Clinic, Entertainment, S2>
16

Experiments
We found 6225 semantic trajectories with an
average length equal to 5.2 stops.
We run the sequential pattern algorithm and we
measured the quality of the results with two
measures:
 the coverage coefficient
 the distance coefficient.
17
The dataset contains trajectories of
17000 moving cars in Milan, in one
week, collected through GPS
devices.

Experiments: Quality of the
analysis
the coverage coefficient measures how many
patterns extracted from the original dataset
are covered (have a superclass in the taxonomy)
by the patterns extracted in the anonymized
dataset
18

Experiments: Coverage
Coefficient19

Experiments: Quality of the
analysis
Distance coefficient represents the distance in
terms of steps in the taxonomy to transform
the patterns from the set extracted on the
original dataset and the one from the
anonymized dataset.
20

Experiments: Distance
Coefficient

Conclusions and Future work
 Improve the algorithm with better heuristics
and that does not consider only groups of a
fixed size.
 More experiments with other mining
algorithms
 More utility measures for the evaluation of
results
 Another future research direction goes
towards the exploitation of c-safe semantic
trajectories dataset for semantic tagging of
trajectories. How does the anonymization step
22

Cast

Recommended

Recommended

More Related Content

Similar to Cast

Similar to Cast (20)

Recently uploaded

Recently uploaded (20)

Cast