Libra: High-Utility
Anonymization of Event Logs
for Process Mining via
Subsampling
Gamal Elkoumy and Marlon Dumas
University of Tartu
ICPM 22
This Photo by Unknown author is licensed under CC BY-ND.
Event Log
2
ID Name Activity Timestamp Age Sex Zip Disease
1 Marco Montali Register 07.01.2020-08:30 37 M 13053 Flu
1 Marco Montali Visit 07.01.2020-08:45 37 M 13053 Flu
2 Fabrizio Maggi Register 07.01.2020-08:46 35 M 51009 Infection
1 Marco Montali Blood Test 07.01.2020-08:57 37 M 13053 Flu
1 Marco Montali Discharge 07.01.2020-08:58 37 M 13053 Flu
2 Fabrizio Maggi Hospitalize 07.01.2020-09:01 35 M 51009 Infection
2 Fabrizio Maggi Blood Test 07.01.2020-10:30 35 M 51009 Infection
2 Fabrizio Maggi Visit 07.02.2020-09:35 35 M 51009 Infection
2 Fabrizio Maggi Discharge 07.02.2020-14:00 35 M 51009 Infection
Motivation- GDPR
3
Singling Out
Singling Out
Marco Montali checked-in at
8:30 AM and his blood test
took 1 minute.
10 patients took Blood Tests
on that day, and on average a
test takes 2 minutes.
5
Releasing The Log
Releasing The Log
Releasing The Log
Releasing The Log
Attack Model
The attacker has a goal h(L) to infer information
from a log. We consider the following goals:
• ℎ1 : Determine if an individual is in the log
through their execution flow.
• ℎ2: Determine the execution time of an activity.
10
Differential Privacy
• Mitigates linkage attacks
including past, present, and
future releases of datasets.
• Provides quantification for
privacy loss.
• Provides the composition of
multiple simple mechanisms
into a bigger one.
11
Research Question
• Given an event log L, wherein each trace contains
private information about an individual (e.g., a
customer),
• and given a privacy budget ε,
• generate an anonymized event log L’ that provides
an ε-differential privacy guarantee to each
individual represented in the log.
12
Proposal – Privacy Amplification
• Differential privacy guarantees can be
amplified by applying DP to a small
random subsample of records.
13
Proposal
• We use privacy amplification to
achieve a lower utility loss relative
to classic DP-anonymization
techniques for a given level of
privacy.
• We use the composition property
of differential privacy to compose
the separately anonymized
subsamples to establish the final
anonymized log.
14
Approach
15
• We filter rare trace variants.
• Rare trace variants are the variants
that are executed for a group of few
individuals.
• Observing such traces may increase
the attacker’s confidence about this
group of individuals.
• We remove trace variants that
occurs < C.
16
• We perform Poisson subsampling to
achieve privacy amplification.
17
• We can use any of the DP
approaches in the literature.
• We use the approach presented
at ICPM21.
• “Mine Me but Don’t Single Me
Out: Differentially Private Event
Logs for Process Mining”
18
• We anonymize the case variants
of the log by means of
over/under-sampling.
• We anonymize the start time of
a case by displacing it left or
right according to a Laplacian
distribution.
19
• We select the statistically
significant traces out of the
anonymized subsamples to
provide a higher utility of the DP
event log.
• We adopt the statistically
significant sampling presented by
Bauer et al 2018.
• Note: DP guarantees are still
preserved. (We use differentially-
private post-processing).
20
• We combine the anonymized
subsamples to construct the
anonymized log.
• Note: DP gives a quantification
of the privacy guarantees after
the composition.
• We use Renyi DP to estimate
the composition privacy
guarantees.
21
Empirical Evaluation
• We measure the distance between the anonymized log and
the original log.
• The distance is quantified as the Earth mover’s distance
between the DFG of the anonymized log and the original log.
Empirical Evaluation
• We evaluate the approach using 8 real-world event logs.
• We compare the approach to the state-of-the-art.
• All the anonymized logs are publicly available at
https://doi.org/10.5281/zenodo.6376761.
Empirical Evaluation
24
Summary
• In this paper, we used the different properties of differential privacy
to enable high utility anonymization.
• We have used privacy amplification to provide the same privacy
guarantees while reducing the noise.
• We have used the differentially-private post-processing to select the
statistically significant traces which increased the utility.
• We have used the composition to combine the anonymized
subsamples.
Thank you for attending!
Questions?

Elkoumy - Libra - ICPM22.pptx

  • 1.
    Libra: High-Utility Anonymization ofEvent Logs for Process Mining via Subsampling Gamal Elkoumy and Marlon Dumas University of Tartu ICPM 22 This Photo by Unknown author is licensed under CC BY-ND.
  • 2.
    Event Log 2 ID NameActivity Timestamp Age Sex Zip Disease 1 Marco Montali Register 07.01.2020-08:30 37 M 13053 Flu 1 Marco Montali Visit 07.01.2020-08:45 37 M 13053 Flu 2 Fabrizio Maggi Register 07.01.2020-08:46 35 M 51009 Infection 1 Marco Montali Blood Test 07.01.2020-08:57 37 M 13053 Flu 1 Marco Montali Discharge 07.01.2020-08:58 37 M 13053 Flu 2 Fabrizio Maggi Hospitalize 07.01.2020-09:01 35 M 51009 Infection 2 Fabrizio Maggi Blood Test 07.01.2020-10:30 35 M 51009 Infection 2 Fabrizio Maggi Visit 07.02.2020-09:35 35 M 51009 Infection 2 Fabrizio Maggi Discharge 07.02.2020-14:00 35 M 51009 Infection
  • 3.
  • 4.
  • 5.
    Singling Out Marco Montalichecked-in at 8:30 AM and his blood test took 1 minute. 10 patients took Blood Tests on that day, and on average a test takes 2 minutes. 5
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Attack Model The attackerhas a goal h(L) to infer information from a log. We consider the following goals: • ℎ1 : Determine if an individual is in the log through their execution flow. • ℎ2: Determine the execution time of an activity. 10
  • 11.
    Differential Privacy • Mitigateslinkage attacks including past, present, and future releases of datasets. • Provides quantification for privacy loss. • Provides the composition of multiple simple mechanisms into a bigger one. 11
  • 12.
    Research Question • Givenan event log L, wherein each trace contains private information about an individual (e.g., a customer), • and given a privacy budget ε, • generate an anonymized event log L’ that provides an ε-differential privacy guarantee to each individual represented in the log. 12
  • 13.
    Proposal – PrivacyAmplification • Differential privacy guarantees can be amplified by applying DP to a small random subsample of records. 13
  • 14.
    Proposal • We useprivacy amplification to achieve a lower utility loss relative to classic DP-anonymization techniques for a given level of privacy. • We use the composition property of differential privacy to compose the separately anonymized subsamples to establish the final anonymized log. 14
  • 15.
  • 16.
    • We filterrare trace variants. • Rare trace variants are the variants that are executed for a group of few individuals. • Observing such traces may increase the attacker’s confidence about this group of individuals. • We remove trace variants that occurs < C. 16
  • 17.
    • We performPoisson subsampling to achieve privacy amplification. 17
  • 18.
    • We canuse any of the DP approaches in the literature. • We use the approach presented at ICPM21. • “Mine Me but Don’t Single Me Out: Differentially Private Event Logs for Process Mining” 18
  • 19.
    • We anonymizethe case variants of the log by means of over/under-sampling. • We anonymize the start time of a case by displacing it left or right according to a Laplacian distribution. 19
  • 20.
    • We selectthe statistically significant traces out of the anonymized subsamples to provide a higher utility of the DP event log. • We adopt the statistically significant sampling presented by Bauer et al 2018. • Note: DP guarantees are still preserved. (We use differentially- private post-processing). 20
  • 21.
    • We combinethe anonymized subsamples to construct the anonymized log. • Note: DP gives a quantification of the privacy guarantees after the composition. • We use Renyi DP to estimate the composition privacy guarantees. 21
  • 22.
    Empirical Evaluation • Wemeasure the distance between the anonymized log and the original log. • The distance is quantified as the Earth mover’s distance between the DFG of the anonymized log and the original log.
  • 23.
    Empirical Evaluation • Weevaluate the approach using 8 real-world event logs. • We compare the approach to the state-of-the-art. • All the anonymized logs are publicly available at https://doi.org/10.5281/zenodo.6376761.
  • 24.
  • 25.
    Summary • In thispaper, we used the different properties of differential privacy to enable high utility anonymization. • We have used privacy amplification to provide the same privacy guarantees while reducing the noise. • We have used the differentially-private post-processing to select the statistically significant traces which increased the utility. • We have used the composition to combine the anonymized subsamples.
  • 26.
    Thank you forattending! Questions?

Editor's Notes

  • #25 The empirical evaluation shows that the privacy amplification effect leads to significant reductions of utility loss, particularly when it comes to anonymizing the frequency of distribution of case variants.