Mine Me but Don’t
Single Me out:
Differentially
Private Event Logs
for Process Mining
Gamal Elkoumy, Alisa Pankova, Marlon Dumas
University of Tartu and Cybernetica
Estonia
Event Log
Privacy
Threats
John Doe checked-in at 8:33 AM
and had a surgery from 8:35 AM to
9:25 AM.
On average, the activity "surgery"
happens between 8 AM and 10
AM with an execution time
between 30 minutes and 2 hours.
GDPR
Motivation
• Masking (pseudonymization)
is not enough.
• An attacker can use prefixes
or suffixes of the John’s trace
to identify him.
• Also, the attacker can use
the event timestamp to
identify John.
Attack Model
The attacker has a goal ℎ 𝐿 that captures their
interest in an event log 𝐿. We specifically
consider the following attacker’s goals:
• ℎ1: Has the individual been through a specific
sub-trace (prefix or suffix)? The output is a
bit with a value ϵ 0, 1 that represents yes or
no.
• ℎ2 : What is the execution time of a
particular activity that has been executed for
the individual? The output is a real value to
be guessed with precision.
Motivation
• We do not want the attacker to guess John Doe after
releasing the event log.
• We do not want that guessing probability to increase by a
certain amount after the event log release.
• We use this guessing advantage probability (δ) to
anonymize event logs.
Problem Statement
• Given an event log L, and given a
maximum level of acceptable guessing
advantage δ , generate an anonymized
event log L' such that the probability of
singling out an individual after
publishing L' does not increase by more
than δ .
Differential Privacy
ε
Approach
Approach – DAFSA
• Group events that go through the same prefixes/suffixes.
• Minimal Grouping shared prefixes/suffixes.
• Lossless representation of event log.
• Oversampling size estimation.
• Timestamp anonymization
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
Approach – Event Log Annotation
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
Approach – ε estimation
• The effect of ε differs based on the range of values
• Two ε values:
• ε for trace variants anonymization.
• ε for timestamp anonymization.
• We estimate different ε for each event based on the
distribution of values.
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
ε estimation – Personalized Differential Privacy
Input: δ=0.2
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
ε estimation – Personalized Differential Privacy
ε= 0.81 for δ=0.2
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
Approach - Oversampling
• The set of case variants is the main input used by process
mining techniques, e.g., conformance checking.
• In this setting, having the same set of case variants is
critical.
• Adding new case variants increases the false positives.
• Removing existing case variants increases the false
negatives.
• We adopt Oversampling over the DAFSA transition to
prevent singling out an individual using their prefix/suffix.
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
Approach - Oversampling
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
Approach – Time Noise Injection
Annotate
Event Log with
DAFSA states
ε estimation
for relative
time
ε estimation
for trace
variants
Oversampling
Cases
Noise Injection
to Timestamps
Anonymized Log
Empirical
Evaluation
We selected 14 publicly available event
logs.
What is the effect of choosing a privacy
level δ on the time dilation of the
anonymized event log?
What is the effect of choosing a privacy
level σ on the case variant distribution
of the anonymized event log?
Empirical Evaluation
Empirical Evaluation
• We proposed a concept of differentially private
event log and a mechanism to compute such logs.
• A differentially private event log limits the increase
in the probability that an attacker may single out
an individual based on the prefixes or suffixes of
the traces in the log and the timestamps of each
event.
Conclusion
and Future
Work
Conclusion
and Future
Work
A limitation is that the proposed method
introduces high levels of noise in the presence of
unique traces or temporal outliers. We plan to
investigate an approach where high-risk traces
are suppressed so that the amount of injected
noise into the remaining traces is lower.
A second future research avenue is to consider
anonymizing other columns in the event log, e.g.,
resources.
Questions

Mine me but don't single me out ICPM21

  • 1.
    Mine Me butDon’t Single Me out: Differentially Private Event Logs for Process Mining Gamal Elkoumy, Alisa Pankova, Marlon Dumas University of Tartu and Cybernetica Estonia
  • 2.
  • 3.
    Privacy Threats John Doe checked-inat 8:33 AM and had a surgery from 8:35 AM to 9:25 AM. On average, the activity "surgery" happens between 8 AM and 10 AM with an execution time between 30 minutes and 2 hours.
  • 4.
  • 5.
    Motivation • Masking (pseudonymization) isnot enough. • An attacker can use prefixes or suffixes of the John’s trace to identify him. • Also, the attacker can use the event timestamp to identify John.
  • 6.
    Attack Model The attackerhas a goal ℎ 𝐿 that captures their interest in an event log 𝐿. We specifically consider the following attacker’s goals: • ℎ1: Has the individual been through a specific sub-trace (prefix or suffix)? The output is a bit with a value ϵ 0, 1 that represents yes or no. • ℎ2 : What is the execution time of a particular activity that has been executed for the individual? The output is a real value to be guessed with precision.
  • 7.
    Motivation • We donot want the attacker to guess John Doe after releasing the event log. • We do not want that guessing probability to increase by a certain amount after the event log release. • We use this guessing advantage probability (δ) to anonymize event logs.
  • 8.
    Problem Statement • Givenan event log L, and given a maximum level of acceptable guessing advantage δ , generate an anonymized event log L' such that the probability of singling out an individual after publishing L' does not increase by more than δ .
  • 9.
  • 10.
  • 11.
    Approach – DAFSA •Group events that go through the same prefixes/suffixes. • Minimal Grouping shared prefixes/suffixes. • Lossless representation of event log. • Oversampling size estimation. • Timestamp anonymization Annotate Event Log with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 12.
    Approach – EventLog Annotation Annotate Event Log with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 13.
    Approach – εestimation • The effect of ε differs based on the range of values • Two ε values: • ε for trace variants anonymization. • ε for timestamp anonymization. • We estimate different ε for each event based on the distribution of values. Annotate Event Log with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 14.
    ε estimation –Personalized Differential Privacy Input: δ=0.2 Annotate Event Log with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 15.
    ε estimation –Personalized Differential Privacy ε= 0.81 for δ=0.2 Annotate Event Log with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 16.
    Approach - Oversampling •The set of case variants is the main input used by process mining techniques, e.g., conformance checking. • In this setting, having the same set of case variants is critical. • Adding new case variants increases the false positives. • Removing existing case variants increases the false negatives. • We adopt Oversampling over the DAFSA transition to prevent singling out an individual using their prefix/suffix. Annotate Event Log with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 17.
    Approach - Oversampling Annotate EventLog with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 18.
    Approach – TimeNoise Injection Annotate Event Log with DAFSA states ε estimation for relative time ε estimation for trace variants Oversampling Cases Noise Injection to Timestamps
  • 19.
  • 20.
    Empirical Evaluation We selected 14publicly available event logs. What is the effect of choosing a privacy level δ on the time dilation of the anonymized event log? What is the effect of choosing a privacy level σ on the case variant distribution of the anonymized event log?
  • 21.
  • 22.
  • 23.
    • We proposeda concept of differentially private event log and a mechanism to compute such logs. • A differentially private event log limits the increase in the probability that an attacker may single out an individual based on the prefixes or suffixes of the traces in the log and the timestamps of each event. Conclusion and Future Work
  • 24.
    Conclusion and Future Work A limitationis that the proposed method introduces high levels of noise in the presence of unique traces or temporal outliers. We plan to investigate an approach where high-risk traces are suppressed so that the amount of injected noise into the remaining traces is lower. A second future research avenue is to consider anonymizing other columns in the event log, e.g., resources.
  • 25.