On Improving the Performance of Data Leak Prevention using White-list Approach
1. On Improving the Performance of Data Leak Prevention using
White-list Approach
An-Trung B. Nguyen1
Faculty of Information Technology,
University of Science Hochiminh City,
Vietnam
(84)976205232
1012482@student.hcmus.edu.vn
Trung H. Dinh2
Faculty of Information Technology,
University of Science Hochiminh City,
Vietnam
(84)988913415
1012481@student.hcmus.edu.vn
Dung T. Tran3
Faculty of Information Technology,
University of Science Hochiminh City,
Vietnam
(84)932408618
ttdung@fit.hcmus.edu.vn
ABSTRACT
Fang Hao et al. [1] proposed a Data Leakage Prevention (DLP)
model by combining 128-bit CRC (Cyclic Redundancy Check)
and the Bloom filter [2]. In this paper, we improve on the work
done by Fang Hao et al.[1] by using hash functions combining
with the Bloom filter. The experiment results showed that our
approach has significantly improved the system’s throughput. In
our experiments, we used a 9.3 GB of html files with more than
121,000 files.
Keywords
Data leak prevention, bloom filter, fingerprint, hash function.
1. INTRODUCTION
In recent years, the trend of storing data on the cloud has been
growing fast because of the convenience which comes from this
kind of service. Looking beyond the significant advantages, there
is an important problem that users and service providers have to
consider: the safety of the data. In the cloud environment,
malicious users are always ready hijack or steal sensitive data for
a wide variety of purposes. As a result, cloud service providers are
always looking for new technologies to improve the level of
security which makes such activity more difficult or impossible.
Figure 1 The number of incidents over time (datalossdb.org)
As shown in fig. 1 – taken from DatalossDB [3] – there were over
115 data leakage incidents each month during 2013. Some of
these incidents affected millions of users. Such leakages happened
despite protection mechanisms such as firewalls and IPS/IDS
devices. It seems that current protection tools have failed to
prevent the zero-day attacks, application level compromises, bugs
or misconfigurations. Moreover, most incidents come from the
outside. This demands new approaches to protect data in cloud
storages.
Figure 2 Incident source
As we mentioned earlier, the DLP model – proposed by Fang Hao
[1] in order to prevent the data leakage – builds a fingerprint
database sitting at the border of the networks: all network traffic
has to be inspected. If there is sensitive data, the traffic will be
stopped to minimize the number of leaking bytes. Fang Hao [1]
used a 128-bit CRC to create the fingerprint. This approach is
simple but not efficient for the system’s performance, since
building a database requires system resources and every check for
the data leak need to re-run CRC. We propose an alternative
approach which improves the process of creating the fingerprint
while still keeping the fingerprint collision at a low rate.
Checking every single byte of outgoing files is impractical in
terms of memory usage and computational cost so the outgoing
files should be checked at some sample points, say
t1<t2<t3<…<tk. Those checking points will be determined at the
beginning. Based on these checking points, we create the
fingerprint for every segment of every file in the database and
store them by using the Bloom filter [2]. When users try to
download a file, the DLP will divide the file into many parts
according to the checking points; and then calculate their
fingerprints which are compared with the ones stored in the
Bloom filter. If any mismatches occur, this means that the file is
not on the white list. Consequently, the transmission will be
immediately terminated.
2. The initial step is one of the most important steps in our proposal.
Since the Bloom filter is a probabilistic data structure, the size of
the filter is directly related to the false positive probability. On the
other hands, the collision probability of hash functions is also an
important impact factor to the performance. We will show how
this is so in the experiment section.
In this work, we conduct the experiments of five innovative hash
functions from two popular groups: the cryptographic and non-
cryptographic hash functions. We use the system’s throughput and
the percentage of leaked files to evaluate the hash functions.
This paper has 6 sections. Section 1, the Introduction, sets up the
purpose of the study and determines the standards that will be
utilized to measure the results as well as touches on other
similar studies. We summarize related work in Section 2. In
section 3, we present the Bloom filter, and then the hash functions
in section 4. We present the implementation and experiments’
results in section 5. Our conclusions, with discussion, are
presented in section 6.
2. RELATED WORK
MyDLP[4], IronPort[5] have used a similar approach which is a
network-based prevention model. They applied the two following
techniques:
- Use a black-list to detect data leak.
- Use keywords or regular search expressions to detect
the same.
For instance, WebDLP[6,7], IronPort[5] use similarity based
content classifiers to identify content, then compare it to the
confidential black-list contents. My DLP[6] also uses a black-list
combined with keywords, regular expressions, and full-file hashes
to prevent the data leak problem. These approaches are not fool
proof in that attacker can encrypt or transform content to avoid the
black-list. False positives can also be high because valid data may
contain some of the same keywords in its documents.
Conversely, Fang Hao et. al. [1] proposed a different approach by
extracting fingerprints from each document in white-list, and
building a fingerprint database store at the network boundary. For
each data transmission connection, they inspect outgoing data,
calculate fingerprints of data and check against the database. If the
fingerprint at any checkpoint of the transmission does not match
the database, this transmission is terminated. To reduce overhead
of memory, Fang Hao [1] proposed an algorithm to optimize the
checkpoints’ locations while restraining the worst case data
leakage length.
Figure 3. Add an element to bloom filter
To develop from the work of Fang Hao et. al.[1], we improved the
process speed at the fingerprint creation step. Our approach is
faster but with the same accuracy. We used several different hash
functions to generate fingerprints of documents. Our purpose was
to find a hash function with a high speed and a lower amount of
collision. In order words, to create fingerprints faster, yet still
ensure their accuracy.
3. BLOOM FILTER
A Bloom filter is a data structure designed to tell you, rapidly and
with low memory overhead, whether an item is present in a set.
The price paid for this efficiency is that a Bloom filter is a
probabilistic data structure: if it tells us that the element is not in
set, this element will not be definitely in the set. By contrast, this
element may be or not may be in the set with probability depends
on number of bits of the Bloom filter.
In general, to add an item to the set, we use some hash functions
to calculate the hash values of the item. After that, these hash
values will be the index values used as the input to the bloom
filter.
The idea above is to allocate an array v of m bits, initially all set to
0, and then choose k independent hash functions, {h1,h2,h3,..,hk},
each with range (1...m). For each element a A, the bits at
positions h1(a), h2(a), ..., hk(a) in v are set to 1. It might lead to a
particular bit being set to 1 multiple times. Given a query for b to
check the membership we check the bits at positions h1(b), h2(b),
..., hk(b). If any of them is 0, then certainly b is not in the set A.
Otherwise we conjecture that b is in the set although there is a
certain probability that we are wrong. This is called a “false
positive”. The parameters k and m should be chosen such that the
probability of a false positive (and hence a false hit) is acceptable.
3. Figure 3 Check membership of an item
The salient feature of Bloom filters is that there is a clear trade-off
between m and the probability of a false positive. According to
[8,9], after inserting n keys into a table of size m, the probability
that a particular bit is still 0 is:
(1)
Hence, the probability of a false positive in this situation is
(2)
The right hand side of (2) is minimum when , it
implies:
(3)
In fact k must be an integer, in practice we might chose a value
less than the optimal value to reduce the memory usage. In order
to obtain the expected false positive, we have to consider the
trade-off between the Bloom filter’s size and the false positive
probability. In order words, a system with larger physical memory
can achieve a better false positive value.
In our approach, the index of the Bloom filter is calculated from
the hash value of the item which is the segment between 2
checking points of a document. Hash collision occurs when two
different items give the same hash value. In this paper, we use
experiments to find out the best hash function which gives both a
higher speed and a lower collision rate. The next section will
analyze the collision of hash functions.
4. HASH FUNCTION
4.1 Collision probabilities
Given a set of N distinct hash codes, picked one value. The N−1
remaining hash codes (out of N possible hash codes) are also
distinct. Therefore, the probability of randomly generating two
distinct hash codes out of N possible codes is .
Consequently, the probability of randomly generating three
distinct hash codes is . Since generating a random hash
code is an independent event, the probability of generating k
distinct hash codes is the multiplication of the probabilities of
generating each single hash code.
In general, the probability of generating k distinct hash codes is:
(4)
This would be computational difficult to find a large k.
Fortunately, the equation (4) can be approximated by:
(5)
Figure 5 illustrates the probability of collision when using 32-bit
hash codes. It’s worth noting that collision occurs with probability
of 1/2 when the number of hashes (x-axis) is around 77000. We
also note that the graph takes the same S-curved shape for any
value of N. Since the hash code will be the index of the bloom
filter, the hash function must have two properties: be fast and
provide less collision.
4.2 Hash function selection
The fingerprint has to be long enough to make the probability of
two different items having the same fingerprint infinitesimally
small. For practical applications, a 128 or 256-bit fingerprint
would be sufficient. In our experiments, we examine two types of
hashing: cryptographic and non-cryptographic.
Figure 5. Collision of 32-bit Hash function
5. IMPLEMENTATION AND
EXPERIMENTS
The implementation of our approach consists of three steps:
Preprocessing
Filter construction
On-line checking
5.1 Preprocessing
We first examine the existing file set and generate the file size
distribution. Then we run the dynamic programming algorithm
that mentioned in the paper [1] to compute the optimal filter
placement strategy, including filter locations, number of hash
functions, and filter sizes. There is usually a straightforward
mapping between the expected false positive probability and total
amount of memory needed for the Bloom filter.
5.2 Filter construction
In this step, we construct the actual Bloom filter based on the
optimal filter computed in previous step. In general, Bloom filters
may need multiple hash functions. This can be achieved by
0
20
40
60
80
100
120
0 100000 200000 300000 # hashes
% collision
4. appending multiple random numbers at the end of the data and
computing a new value corresponding to each hash function. We
then use the hash value mod m – the number of bits in the filter –
to map the hash values into a bit index in the Bloom filter. We
repeat this process for all files.
5.3 On-line checking
Once the filters are constructed, the system is ready to launch. At
this step, data traffic is processed and the hash checksums are
continuously computed. Finally, the Bloom filter checking is done
at each pre-determined location. A data flow is allowed to
continue if the Bloom filter gives a hit result. Otherwise the flow
is dropped. To minimize the leakage of small files, we mandate a
check at the end of the data flow, before releasing the last packet.
In this way, small files that are contained in a single packet can
always be checked.
5.4 Experiment scenarios
We used 9.3GB of data to evaluate our algorithm. In order to
collect the data, we used wget command in Linux to download
files on the Internet. In this paper, we crawled on the following
web pages:
- www.tools.ietf.org
- www.tuoitre.vn
- www.bbc.co.uk
- www.nld.com.vn
These web pages are news and scientific pages. Downloaded files
were classified by the file type, and the size of files. To
automatically download the whole site, we used the following
command:
wget -x --convert-links -r address of website
In the experiments with the Bloom filter, data files were processed
following the three implementation steps described in the previous
sections. At the last step, to simulate data leakage, we randomly
selected 1000 files, and then randomly selected one byte within
each file as the “bad byte” to change its value. For each file, we
computed the incremental hash value up to each Bloom filter
location, and checked if such fingerprints were present at the
location. The bad byte was detected when the check failed. The
difference between the detection location and location of the “bad
byte” is the number of bytes that are leaked. This is called the
detection lag.
5.5 Experimental results
Firstly, we used five different hash functions to calculate all the
fingerprints for the same html file set. According to the
experiments, the cryptographic hash functions such as SHA1[10],
MD5[11] spent more time than the non-cryptographic ones such
as Murmur3[12], JinkinHash[13], FNVHash[14] and
CityHash[15]. The experiments’ results show that the number of
leaked files was around 4-5 out of 1000 randomly selected files. It
means the false positive probability is approximately 0.5 percent.
Fig. 6 shows the comparison of hash functions’ speeds when they
hashed all of the data files. The y-axis is the total time needed to
create all the fingerprints. This result showed that City hash is the
fastest hash function. City hash is faster than CRC128 by
approximately 200 seconds. CRC128, SHA1 and MD5 took
around 1500 seconds longer than the City hash. JenkinHash was
as fast as the FNVHash.
According to the above results, the hash functions in non-
cryptographic hash groups have better speed than in cryptographic
hash groups. Although CRC-128 is also a non-cryptographic hash
function, its speed is slower than SHA1 and MD5 – the
cryptographic hash functions. Moreover, CRC-128 and non-
cryptographic hash functions such as JenkinHash, FNVHash and
City Hash have approximately the same collision probability. As a
result, we can use the City hash, FNVHash or JenkinHash instead
of CRC-128 to speed up our system.
Figure 6. Speed of hash functions
Figure 7. Leaked files
Fig. 7 shows the number of leaked files when we checked with
1000 random files. The number of leaked files waives from 4 to 7.
The Bloom filter approach works much better than the Fingerprint
Comparison approach (FCA)[1] because it saves memory usage
and CPU processing. In our experiments, when the fingerprints
for all the files were created, the Bloom filter required just
5.12MB with the given expected detection lag of 1000 bytes. In
comparison with the FCA, the Bloom filter needed 50 times less
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
second
0
1
2
3
4
5
6
7
8
0
200
400
600
800
1000
1200
1400
1600
1800
2000
second # leaked files
5. memory than FCA’s. This is because FCA has to stores each
fingerprint instead of just asserting a single bit as in the Bloom
filter. The Bloom filter does not need to compare the whole
fingerprint when it checks an incoming string. As a result, this
improves the throughput when users download documents from
our database.
Fig. 8 shows the average detection lag when we apply different
hash functions to create fingerprints. The results show that the
detection lags are not much different. This means that using the
City hash could improve the system’s throughput while keeping a
comparable leaking rate.
Figure 8. Average detection lag
6. CONCLUSION AND DISCUSSION
In this paper, we applied the non-cryptographic hash functions to
create fingerprints for documents which are used to detect and
prevent the leaked data. Our experiments’ results show the
proposed method has improved the system’s throughput while
keeping the same level of data leaking in comparison with other
approaches in [1].
In our experiments, we used a single core CPU with limited RAM
– 4 GB – because of the shortage of resources. In the practical
cloud environment, the result could be far better since the
available RAM would be much larger. The system could take this
advantage to reduce the false positive rates. We expect a
significant improvement if this approach is deployed in a multi-
core CPU system.
7. REFERENCES
[1] Fang Hao, Kodialam, M., Lakshman, T.V., Puttaswamy,
K.P.N., "Protecting cloud data using dynamic inline fingerprint
checks," INFOCOM, 2013 Proceedings IEEE , vol., no.,
pp.2877,2885, 14-19, April 2013.
[2] Burton H. Bloom. “Space/time trade-offs in hash coding with
allowable errors”. Communications of the ACM 13, 422-426, July
1970.
[3] Data Loss DB http://www.datalossdb.org/statistics
[4] Mydlp - data leak prevention, http://www.mydlp.com.
[5] Cisco, “Cisco ironport data loss prevention,”
http://www.ironport.com/kr/technology/ironport dlp
overview.html.
[6] S. Yoshihama, T. Mishina, and T. Matsumoto, “Web-based
data leakage prevention,” in Proceedings of IWSEC, 2010.
[7] http://www.websense.com/content/home.aspx - WebDLP
[8] Sergey Butakov, “Using Bloom Filters in Data Leak
Protection Applications.” DART@AI*IA, volume 1109 of CEUR
Workshop Proceedings, page 13-24. CEUR-WS.org, 2013.
[9] Fang Hao, Murali Kodialam, and T. V. Lakshman. “Building
high accuracy bloom filters using partitioned hashing”. In
Proceedings of ACM SIGMETRICS '07, New York, USA, 2007.
[10] http://tools.ietf.org/html/rfc3174 - SHA Algorithm
[11] http://www.ietf.org/rfc/rfc1321.txt - MD5 - MD5 Algorithm
[12] https://code.google.com/p/smhasher/ - Murmur 3
[13] http://burtleburtle.net/bob/hash/doobs.html - Jenkins Hash
[14] https://tools.ietf.org/html/draft-eastlake-fnv-07 - FVNHash
Algorithm
[15] https://code.google.com/p/cityhash/ - CityHash Algorithm
184.8
185
185.2
185.4
185.6
185.8
186
186.2
186.4
second