Limitations of Privacy Solutions for Log Files
We have considered applying a range of privacy solutions to log files.
We found that methods such as differential privacy and k-anonymity are not suitable for log files.
We make a proposal that replaces personal identifiers with ring signatures when collecting log files.
In particular we offer a light weight ring signature proposal which significantly improves the privacy for collecting log files while allowing
processing of those log files for tasks such as identifying IoCs.
1. Limitations of Privacy Solutions for Log Files
Jonathan Oliver jon oliver@trendmicro.com
31 August 2021
1 Introduction
In this paper we are considering collecting log files (in particular log files for security purposes)
and the storage / processing of those log files. Some use cases include:
• Working with data which has PII (personally identifiable information) embedded in it.
For example, data with email addresses in it.
• When data is processed in a 3rd party country. For example, data which is collected in
country A may be hosted on on cloud servers in country B. Complex situations may arise
because the data may fall under the laws of country B.
• Extracting IoCs (indicators of compromise) from data. We are interested in IoCs which
are public knowledge and do not uniquely identify a victim.
1.0.1 Privacy Example
Consider a situation with 3 people: Alice, Bob and Charlie. Each person generates log files
which track various events which occur on their computers.
Attackers send personalized malware with the string XYZZY (the malicious IoC) and the
name of the victim encoded. So the logs look like
Person Computer Event Data
------ -------- ----- ----
Alice Computer1 EventA-1 XYZZY-abc
Alice Computer1 EventA-2 XYZZY-abc
Alice Computer1 EventA-3 XYZZY-abc
...
Bob Computer2 EventB-1 XYZZY-def
Bob Computer2 EventB-2 XYZZY-def
Bob Computer2 EventB-3 XYZZY-def
...
Charlie Computer3 EventC-1 XYZZY-ghj
Charlie Computer3 EventC-2 XYZZY-ghj
Charlie Computer3 EventC-3 XYZZY-ghj
where
abc = encrypted(Alice)
def = encrypted(Bob)
ghj = encrypted(Charlie)
We want to extract an IOC associated with this malware (XYZZY in this case) while maximising
the privacy afforded to Alice / Bob / Charlie.
This example is typical of various log files which are generated by security products such as:
1
2. • Email logs;
• Window events logs;
• Firewall logs;
• . . .
1.1 Desireable Properties
We desire a privacy solution which allows us to collect the logs from various machines / computers
and process it in a way that protects the privacy of the individuals. Specifically we want to do
this in a way which meets our privacy requirements
• Collect these logs from multiple computers into a single repository
• Transform / delete parts of the data which identifies a person
• Retaining data which occurs accross multiple people (and hence may be considered public
data)
in a reasonable ammount of computation.
1.2 Review of Privacy Approaches
Here we give a review of the various privacy methods and attempt to apply them to our example
above. here we distinguish between 2 types of data:
• Descriptive data: which has one row per person (the majority of privacy methods ade-
quately address this problem)
• Log files: where a person may contribute multiple rows (typically many rows). This covers
the various log files mentioned above (event logs, firewall logs, etc) and we discuss below
why privacy solutions (such as differential privacy or k-anonymity) do not adeqautely
address these types of data.
1.2.1 Descriptive Data
A typical list of people might look like:
Person Country Industry
Id Name Email
1 Person A a@abc.company Argentina Accounting
2 Person B b@b.company Brazil Manufacturing
. . . . . .
100 Person Z z@z.company USA Health
This type of data can be made “private” using differential privacy or k-anonymity (well respected
privacy approaches used around the world).
1.2.2 Log Files
Log files consist of 2 seperate tables (explicitly or implicitly). Most log files take the form where
the first table defines the people under consideration, and the second table defines events or
transactions for each person in the first table.
2
3. The first table is a list of people:
Table 1
PID Col1 . . . ColMax1
P1 . . .
. . . . . .
PMax . . .
Column 1 is a PID which defines each person.
The second table is a list of events (or transactions) from the people in Table 1:
Table 2
PID Event Id Col1 . . . ColMax2
P1 Event1 . . .
P1 Event2 . . .
P1 Event3 . . .
. . . . . . . . .
Pj EventMax . . .
In the second table, we allow multiple events associated with a personal identifier. For example,
Table 3 has 3 events associated with PID P1.
1.2.3 Privacy Approaches
We review a range of privacy mechanisms in this paper, and consider how they can be applied
to the log file problem. We consider:
• Differential Privacy [1, 2]
• k-anonymity [3, 4]
• Homomorphic Encryption [5, 6]
• Monero style privacy [7]
• Secure Multiparty Computation [8, 9] (which also covers Federated Machine Learning [10])
• Secret Sharing Schemes [11]
1.2.4 Privacy Operations
The operations used by privacy mechanisms (including those listed above) include:
• Suppressing data (either deleting it or replacing it with NULL values);
• Generalizing data (example transforming a persons age into an age range);
• Encrypting data;
• Hashing data; and
• Adding errors to data.
3
4. 2 Differential Privacy
Differential privacy is a system for publicly sharing information about a dataset by describing
the patterns of groups within the dataset while withholding information about individuals in
the dataset.
Consider the situation where we have a data row of interest. If errors are added in a
systematic way so that you get similar or the same answers with / without the row in question,
then we have protected the privacy of that row.
The definition and maths can extend to making 2 rows, 3 rows, ... private. This covers the
case that we may want to allow groups of individuals up to some size N to remain private. So
given N a maximum number of rows that we need to make private at once, we can determine
the error distribution to achieve that.
Differential Privacy is not suited for the log file problem. The amount error required to
achieve privacy on a log file depends on the number of rows which which may be associated with
a person. So a email log file for 1 day, might contain 100 emails from a user. To ensure the
privacy of this data would require an extra-ordinary ammount of error to be added, and almost
certainly make any analysis useless.
3 K-Anonymity
k-anonymity is a property possessed by certain anonymized data. A release of data is said to
have the k-anonymity property if the information for each person contained in the release cannot
be distinguished from at least k − 1 individuals whose information also appear in the release.
k-anonymity does appear to be relevant to the log file problem.
3.1 Limitations K-Anonymity
k-anonymity suffers from the following limitations:
• Background knowledge may be available that is not in the dataset which allows identifi-
cation.
• k-anonymity is not a good method to anonymize high dimensional data For example,
researchers from MIT [12] showed that, given 4 locations, the unicity 1 of mobile phone
timestamp-location datasets can be as high as 95
k-anonymity is not suited for the log file problem, or checking IoCs. The k value in k-
anonymity needs to be replaced by the MaxRows that we associate with a person. So if we
are analysing network logs where a single user has 100 rows, then we would need to apply
k-anonymity with k = 100 which would probably result in nearly all data in the log being
suppressed.
4 Homomorphic Encryption
Homomorphic Encryption involves doing computation on encrypted data. Microsoft in 2012 re-
ported a slow down of 6-7 orders of magnitude (https://www.microsoft.com/en-us/research/wp-
content/uploads/2016/02/323.pdf). UPenn in 2016 reported a slow down of 9 orders of magni-
tude (https://haeberlen.cis.upenn.edu/papers/seabed-osdi2016.pdf). It would appear that Ho-
momorphic Encryption is not yet feasible for working with data at scale or processing large log
files.
1
Unicity is measured by the number of points needed to uniquely identify an individual in a data set.
4
5. 5 Monero Style Privacy
Monero is a crypto-currency where the key features are those around privacy and anonymity:
• The value of transactions is obfuscated.
• Sending addresses are hidden in combination with other addresses (in a ”ring signature”)
so it is not clear exactly who sent a transaction.
• Receiving addresses are hidden using stealth addresses which are generated using a secret
sharing scheme.
There has been a back and forth between Monero and researchers who have pointed out
privacy concerns in the approaches used by Monero. More recently (September 2020), the
United States IRS posted a USD $625,000 bounty to a company to develop tools to help trace
Monero and related crypto-currencies.
6 Secure Multi-party Computation / Federated Learning
The example in Section 1.0.1 high-lights the problem with Federated Learning.
• A learner at Computer1 cannot distinguish between the IoC (XYZZY) and an encoded
version of the first victim (abc).
• A learner at Computer2 cannot distinguish between the IoC (XYZZY) and an encoded
version of the second victim (def).
• A learner at Computer3 cannot distinguish between the IoC (XYZZY) and an encoded
version of the thrid victim (ghj).
We need to merge the records from different people to identify which elements are private and
which elements are suitable as public IoCs. But the very process of merging the records breaks
the very privacy that we are attempting to create.
7 An Approach for Making Log Files Private
7.1 Proposal Step 1: Rewrite Identifiers with a Ring Signature
We may have sensitive data sets where we want/need to replace a personal identifier with another
token for the purposes of clustering / pivoting / identifying IoCs / etc.
The problematic table in a log file is Table 2:
Table 2
PID Event Id Col1 . . . ColMax2
P1 Event1 . . .
P1 Event2 . . .
P1 Event3 . . .
. . . . . . . . .
Pj EventMax . . .
We replace the PID with a Ring Signature for that data row. We define a parameter R to
determine how imprecise each Ring Signature will be. The Ring Signature for EventE which
came from person Pi should be created by
1. SetE = randomly generate a set of R − 1 people;
5
6. 2. RSE = generate a ring signature for the set Pi + SetE
This gives us the following Table:
Table 3
Ring Event Id Col1 . . . ColMax2
Signature
RS1 Event1 . . .
RS2 Event2 . . .
RS3 Event3 . . .
. . . . . . . . .
RSj EventMax . . .
7.2 Proposal Step 2: Apply a modified k-anonymity
We now apply a modified k-anonymity procedure to Table 3. We apply a range of feature
extraction approaches (from Security or Machine Learning). Each of these methods gives use a
candidate feature, F, with a group of rows, G.
We apply the following steps to determine if F is potentially a privacy violation.
1. get the set of ring signatures for group G
2. MinPID(F) = process this set of ring signatures to determine the minimum number of
identities in the group
3. If MinPID(F) ≤ k then feature F is a privacy violation and needs to be suppressed or
deleted.
If MinPID(F) > k, then F (independant of other features) can be considered anonymous since
in isolation we can associate a set of identities with it (at least k identities).
7.3 Properties of Table 3
Table 3 is a useful table for identifying pivots and IoCs.
Lets consider the situation where we have logs from 100 people and each person has 100
events in Table 3. Let the Ring imprecision parameter R = 5. Table 3 has 10,000 events. Lets
consider what an attacker who got the entire contents of Table 3 might do:
• They may try to extract information about a specific event. Due to the ring signature,
they have R = 5 unidentified people that it may come from.
• They may try to extract all the events for person Pi. They would get a collection of 100
events from Pi and a collection of 400 events which were not generated by person Pi.
All they could identify was that each event had a chance of 1
R of really being from some
unidentified person.
7.4 Light Weight Ring Signatures (LWRS)
Most Ring Signature approaches create large signatures; the size of the cryptographic signature
increases linearly with the number of people (identifiers) which you are anonymizing [13, Section
Efficiency]. This makes their use for large log files / large sets of people more difficult.
Many aspects of the above proposal can be satisfied by the following approach:
• Allocate each person a large prime (a few hundred bits);
• The ring signature for a set of people is the product of the primes for each person;
6
7. • Given two light weight ring signatures, we can determine if they have one or more people
in common by performing a greatest common divisor (GCD) operation.
If the GCD(LWRS1, LWRS2) = 1 then we know that these 2 rows came from different
identities. We can do pairwise GCD calculations to show a group of LWRS came from > k
identities.
7.5 Worked Example
We now apply the proposal to the example from Section 1.0.1.
The data:
Person Location data Event Data
------ -------- ----- ----
Alice Computer1 EventA-1 XYZZY-abc
Alice Computer1 EventA-2 XYZZY-abc
Alice Computer1 EventA-3 XYZZY-abc
...
Bob Computer2 EventB-1 XYZZY-def
Bob Computer2 EventB-2 XYZZY-def
Bob Computer2 EventB-3 XYZZY-def
...
Charlie Computer3 EventC-1 XYZZY-ghj
Charlie Computer3 EventC-2 XYZZY-ghj
Charlie Computer3 EventC-3 XYZZY-ghj
where
abc = encrypted(Alice)
def = encrypted(Bob)
ghj = encrypted(Charlie)
7.6 Step 1: Rewrite Identifiers with a Ring Signature
We assign the following primes2:
Alice 3
Bob 13
Charlie 19
We generate Light Weight Ring Signatures for each person.
This results in an intermediate data set:
LW Ring Signature Data
----------------- ----
3 x 11 x 23 XYZZY-abc
3 x 29 x 31 XYZZY-abc
3 x 29 x 37 XYZZY-abc
...
5 x 13 x 17 XYZZY-def
13 x 19 x 57 XYZZY-def
13 x 7 x 61 XYZZY-def
...
19 x 57 x 67 XYZZY-ghj
19 x 5 x 71 XYZZY-ghj
11 x 19 x 73 XYZZY-ghj
2
In this example, we use small primes, but it a real application we would use large primes with 200+ binary
digits.
7
8. 7.7 Step 2: Apply a modified k-anonymity
We define the GCD of a feature:
GCD(F) = GCD(set of LWRS for Feature F)
We now evaluate the GCD for a range of features:
• “XYZZY-abc”
• “XYZZY-def”
• “XYZZY-ghi”
• “XYZZY”
• “abc”
• “def”
• “ghi”
The group of data associated with feature = ”XYZZY-abc” has
GCD(“XYZZY − abc′′
) = GCD(3x11x23, 3x29x31, 3x29x37) = 3
and hence there data rows most likely came from a single person. Thus this feature should be
rejected.
Similarly,
GCD(“XYZZY − def′′
) = 13 AND GCD(“XYZZY − ghj′′
) = 19
and hence these strings must not be retained.
When we apply common string algorithms to the data, we also consider the strings ”abc”,
”def”, ”ghjh” and ”XYZZY”. We find that
GCD(“abc′′
) = 3 AND GCD(“def′′
) = 13 AND GCD(“ghj′′
) = 19
so these strings must not be retained. We find
GCD(“XYZZY′′
) = 1
so this feature can be used - we know it comes from multiple people.
The final transformed data set is:
LW Ring Signature Data
----------------- ----
3 x 11 x 23 XYZZY
3 x 29 x 31 XYZZY
3 x 29 x 37 XYZZY
...
5 x 13 x 17 XYZZY
13 x 19 x 57 XYZZY
13 x 7 x 61 XYZZY
...
19 x 57 x 67 XYZZY
19 x 5 x 71 XYZZY
11 x 19 x 73 XYZZY
8
9. 8 Conclusion
We have considered applying a range of privacy solutions to log files. We found that methods
such as differential privacy and k-anonymity are not suitable for log files. We make a proposal
that replaces personal identifiers with ring signatures when collecting log files. In particular we
offer a light weight ring signature proposal which significantly improves the privacy for collecting
log files while allowing processing of those log files for tasks such as identifying IoCs.
References
[1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in
private data analysis,” in Theory of cryptography conference. Springer, 2006, pp. 265–284,
https://link.springer.com/content/pdf/10.1007/11681878 14.pdf.
[2] “Differential privacy,” https://en.wikipedia.org/wiki/Differential privacy, [Online; accessed
17-May-2020].
[3] P. Samarati and L. Sweeney, “Protecting privacy when disclosing information:
k-anonymity and its enforcement through generalization and suppression,” 1998,
https://dataprivacylab.org/dataprivacy/projects/kanonymity/paper3.pdf.
[4] “K-anonymity,” https://en.wikipedia.org/wiki/K-anonymity, [Online; accessed 17-May-
2020].
[5] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Proceedings of the forty-
first annual ACM symposium on Theory of computing, 2009, pp. 169–178.
[6] “Homomorphic encryption,” https://en.wikipedia.org/wiki/Homomorphic encryption,
[Online; accessed 17-May-2020].
[7] “Monero,” https://en.wikipedia.org/wiki/Monero, [Online; accessed 17-May-2020].
[8] A. C. Yao, “Protocols for secure computations,” in 23rd annual symposium on foundations
of computer science (sfcs 1982). IEEE, 1982, pp. 160–164.
[9] “Secure multi-party computation,” https://en.wikipedia.org/wiki/Secure multi-party computation,
[Online; accessed 17-May-2020].
[10] “Federated learning,” https://en.wikipedia.org/wiki/Federated learning, [Online; accessed
17-May-2020].
[11] “Secret sharing,” https://en.wikipedia.org/wiki/Secret sharing, [Online; accessed 17-May-
2020].
[12] Y.-A. De Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel, “Unique in the crowd:
The privacy bounds of human mobility,” Scientific reports, vol. 3, no. 1, pp. 1–5, 2013,
https://www.nature.com/articles/srep01376.
[13] “Ring signature,” https://en.wikipedia.org/wiki/Ring signature, [Online; accessed 17-May-
2020].
9