SlideShare a Scribd company logo
1 of 5
Download to read offline
On Improving the Performance of Data Leak Prevention using
White-list Approach
An-Trung B. Nguyen1
Faculty of Information Technology,
University of Science Hochiminh City,
Vietnam
(84)976205232
1012482@student.hcmus.edu.vn
Trung H. Dinh2
Faculty of Information Technology,
University of Science Hochiminh City,
Vietnam
(84)988913415
1012481@student.hcmus.edu.vn
Dung T. Tran3
Faculty of Information Technology,
University of Science Hochiminh City,
Vietnam
(84)932408618
ttdung@fit.hcmus.edu.vn
ABSTRACT
Fang Hao et al. [1] proposed a Data Leakage Prevention (DLP)
model by combining 128-bit CRC (Cyclic Redundancy Check)
and the Bloom filter [2]. In this paper, we improve on the work
done by Fang Hao et al.[1] by using hash functions combining
with the Bloom filter. The experiment results showed that our
approach has significantly improved the system’s throughput. In
our experiments, we used a 9.3 GB of html files with more than
121,000 files.
Keywords
Data leak prevention, bloom filter, fingerprint, hash function.
1. INTRODUCTION
In recent years, the trend of storing data on the cloud has been
growing fast because of the convenience which comes from this
kind of service. Looking beyond the significant advantages, there
is an important problem that users and service providers have to
consider: the safety of the data. In the cloud environment,
malicious users are always ready hijack or steal sensitive data for
a wide variety of purposes. As a result, cloud service providers are
always looking for new technologies to improve the level of
security which makes such activity more difficult or impossible.
Figure 1 The number of incidents over time (datalossdb.org)
As shown in fig. 1 – taken from DatalossDB [3] – there were over
115 data leakage incidents each month during 2013. Some of
these incidents affected millions of users. Such leakages happened
despite protection mechanisms such as firewalls and IPS/IDS
devices. It seems that current protection tools have failed to
prevent the zero-day attacks, application level compromises, bugs
or misconfigurations. Moreover, most incidents come from the
outside. This demands new approaches to protect data in cloud
storages.
Figure 2 Incident source
As we mentioned earlier, the DLP model – proposed by Fang Hao
[1] in order to prevent the data leakage – builds a fingerprint
database sitting at the border of the networks: all network traffic
has to be inspected. If there is sensitive data, the traffic will be
stopped to minimize the number of leaking bytes. Fang Hao [1]
used a 128-bit CRC to create the fingerprint. This approach is
simple but not efficient for the system’s performance, since
building a database requires system resources and every check for
the data leak need to re-run CRC. We propose an alternative
approach which improves the process of creating the fingerprint
while still keeping the fingerprint collision at a low rate.
Checking every single byte of outgoing files is impractical in
terms of memory usage and computational cost so the outgoing
files should be checked at some sample points, say
t1<t2<t3<…<tk. Those checking points will be determined at the
beginning. Based on these checking points, we create the
fingerprint for every segment of every file in the database and
store them by using the Bloom filter [2]. When users try to
download a file, the DLP will divide the file into many parts
according to the checking points; and then calculate their
fingerprints which are compared with the ones stored in the
Bloom filter. If any mismatches occur, this means that the file is
not on the white list. Consequently, the transmission will be
immediately terminated.
The initial step is one of the most important steps in our proposal.
Since the Bloom filter is a probabilistic data structure, the size of
the filter is directly related to the false positive probability. On the
other hands, the collision probability of hash functions is also an
important impact factor to the performance. We will show how
this is so in the experiment section.
In this work, we conduct the experiments of five innovative hash
functions from two popular groups: the cryptographic and non-
cryptographic hash functions. We use the system’s throughput and
the percentage of leaked files to evaluate the hash functions.
This paper has 6 sections. Section 1, the Introduction, sets up the
purpose of the study and determines the standards that will be
utilized to measure the results as well as touches on other
similar studies. We summarize related work in Section 2. In
section 3, we present the Bloom filter, and then the hash functions
in section 4. We present the implementation and experiments’
results in section 5. Our conclusions, with discussion, are
presented in section 6.
2. RELATED WORK
MyDLP[4], IronPort[5] have used a similar approach which is a
network-based prevention model. They applied the two following
techniques:
- Use a black-list to detect data leak.
- Use keywords or regular search expressions to detect
the same.
For instance, WebDLP[6,7], IronPort[5] use similarity based
content classifiers to identify content, then compare it to the
confidential black-list contents. My DLP[6] also uses a black-list
combined with keywords, regular expressions, and full-file hashes
to prevent the data leak problem. These approaches are not fool
proof in that attacker can encrypt or transform content to avoid the
black-list. False positives can also be high because valid data may
contain some of the same keywords in its documents.
Conversely, Fang Hao et. al. [1] proposed a different approach by
extracting fingerprints from each document in white-list, and
building a fingerprint database store at the network boundary. For
each data transmission connection, they inspect outgoing data,
calculate fingerprints of data and check against the database. If the
fingerprint at any checkpoint of the transmission does not match
the database, this transmission is terminated. To reduce overhead
of memory, Fang Hao [1] proposed an algorithm to optimize the
checkpoints’ locations while restraining the worst case data
leakage length.
Figure 3. Add an element to bloom filter
To develop from the work of Fang Hao et. al.[1], we improved the
process speed at the fingerprint creation step. Our approach is
faster but with the same accuracy. We used several different hash
functions to generate fingerprints of documents. Our purpose was
to find a hash function with a high speed and a lower amount of
collision. In order words, to create fingerprints faster, yet still
ensure their accuracy.
3. BLOOM FILTER
A Bloom filter is a data structure designed to tell you, rapidly and
with low memory overhead, whether an item is present in a set.
The price paid for this efficiency is that a Bloom filter is a
probabilistic data structure: if it tells us that the element is not in
set, this element will not be definitely in the set. By contrast, this
element may be or not may be in the set with probability depends
on number of bits of the Bloom filter.
In general, to add an item to the set, we use some hash functions
to calculate the hash values of the item. After that, these hash
values will be the index values used as the input to the bloom
filter.
The idea above is to allocate an array v of m bits, initially all set to
0, and then choose k independent hash functions, {h1,h2,h3,..,hk},
each with range (1...m). For each element a A, the bits at
positions h1(a), h2(a), ..., hk(a) in v are set to 1. It might lead to a
particular bit being set to 1 multiple times. Given a query for b to
check the membership we check the bits at positions h1(b), h2(b),
..., hk(b). If any of them is 0, then certainly b is not in the set A.
Otherwise we conjecture that b is in the set although there is a
certain probability that we are wrong. This is called a “false
positive”. The parameters k and m should be chosen such that the
probability of a false positive (and hence a false hit) is acceptable.
Figure 3 Check membership of an item
The salient feature of Bloom filters is that there is a clear trade-off
between m and the probability of a false positive. According to
[8,9], after inserting n keys into a table of size m, the probability
that a particular bit is still 0 is:
(1)
Hence, the probability of a false positive in this situation is
(2)
The right hand side of (2) is minimum when , it
implies:
(3)
In fact k must be an integer, in practice we might chose a value
less than the optimal value to reduce the memory usage. In order
to obtain the expected false positive, we have to consider the
trade-off between the Bloom filter’s size and the false positive
probability. In order words, a system with larger physical memory
can achieve a better false positive value.
In our approach, the index of the Bloom filter is calculated from
the hash value of the item which is the segment between 2
checking points of a document. Hash collision occurs when two
different items give the same hash value. In this paper, we use
experiments to find out the best hash function which gives both a
higher speed and a lower collision rate. The next section will
analyze the collision of hash functions.
4. HASH FUNCTION
4.1 Collision probabilities
Given a set of N distinct hash codes, picked one value. The N−1
remaining hash codes (out of N possible hash codes) are also
distinct. Therefore, the probability of randomly generating two
distinct hash codes out of N possible codes is .
Consequently, the probability of randomly generating three
distinct hash codes is . Since generating a random hash
code is an independent event, the probability of generating k
distinct hash codes is the multiplication of the probabilities of
generating each single hash code.
In general, the probability of generating k distinct hash codes is:
(4)
This would be computational difficult to find a large k.
Fortunately, the equation (4) can be approximated by:
(5)
Figure 5 illustrates the probability of collision when using 32-bit
hash codes. It’s worth noting that collision occurs with probability
of 1/2 when the number of hashes (x-axis) is around 77000. We
also note that the graph takes the same S-curved shape for any
value of N. Since the hash code will be the index of the bloom
filter, the hash function must have two properties: be fast and
provide less collision.
4.2 Hash function selection
The fingerprint has to be long enough to make the probability of
two different items having the same fingerprint infinitesimally
small. For practical applications, a 128 or 256-bit fingerprint
would be sufficient. In our experiments, we examine two types of
hashing: cryptographic and non-cryptographic.
Figure 5. Collision of 32-bit Hash function
5. IMPLEMENTATION AND
EXPERIMENTS
The implementation of our approach consists of three steps:
 Preprocessing
 Filter construction
 On-line checking
5.1 Preprocessing
We first examine the existing file set and generate the file size
distribution. Then we run the dynamic programming algorithm
that mentioned in the paper [1] to compute the optimal filter
placement strategy, including filter locations, number of hash
functions, and filter sizes. There is usually a straightforward
mapping between the expected false positive probability and total
amount of memory needed for the Bloom filter.
5.2 Filter construction
In this step, we construct the actual Bloom filter based on the
optimal filter computed in previous step. In general, Bloom filters
may need multiple hash functions. This can be achieved by
0
20
40
60
80
100
120
0 100000 200000 300000 # hashes
% collision
appending multiple random numbers at the end of the data and
computing a new value corresponding to each hash function. We
then use the hash value mod m – the number of bits in the filter –
to map the hash values into a bit index in the Bloom filter. We
repeat this process for all files.
5.3 On-line checking
Once the filters are constructed, the system is ready to launch. At
this step, data traffic is processed and the hash checksums are
continuously computed. Finally, the Bloom filter checking is done
at each pre-determined location. A data flow is allowed to
continue if the Bloom filter gives a hit result. Otherwise the flow
is dropped. To minimize the leakage of small files, we mandate a
check at the end of the data flow, before releasing the last packet.
In this way, small files that are contained in a single packet can
always be checked.
5.4 Experiment scenarios
We used 9.3GB of data to evaluate our algorithm. In order to
collect the data, we used wget command in Linux to download
files on the Internet. In this paper, we crawled on the following
web pages:
- www.tools.ietf.org
- www.tuoitre.vn
- www.bbc.co.uk
- www.nld.com.vn
These web pages are news and scientific pages. Downloaded files
were classified by the file type, and the size of files. To
automatically download the whole site, we used the following
command:
wget -x --convert-links -r address of website
In the experiments with the Bloom filter, data files were processed
following the three implementation steps described in the previous
sections. At the last step, to simulate data leakage, we randomly
selected 1000 files, and then randomly selected one byte within
each file as the “bad byte” to change its value. For each file, we
computed the incremental hash value up to each Bloom filter
location, and checked if such fingerprints were present at the
location. The bad byte was detected when the check failed. The
difference between the detection location and location of the “bad
byte” is the number of bytes that are leaked. This is called the
detection lag.
5.5 Experimental results
Firstly, we used five different hash functions to calculate all the
fingerprints for the same html file set. According to the
experiments, the cryptographic hash functions such as SHA1[10],
MD5[11] spent more time than the non-cryptographic ones such
as Murmur3[12], JinkinHash[13], FNVHash[14] and
CityHash[15]. The experiments’ results show that the number of
leaked files was around 4-5 out of 1000 randomly selected files. It
means the false positive probability is approximately 0.5 percent.
Fig. 6 shows the comparison of hash functions’ speeds when they
hashed all of the data files. The y-axis is the total time needed to
create all the fingerprints. This result showed that City hash is the
fastest hash function. City hash is faster than CRC128 by
approximately 200 seconds. CRC128, SHA1 and MD5 took
around 1500 seconds longer than the City hash. JenkinHash was
as fast as the FNVHash.
According to the above results, the hash functions in non-
cryptographic hash groups have better speed than in cryptographic
hash groups. Although CRC-128 is also a non-cryptographic hash
function, its speed is slower than SHA1 and MD5 – the
cryptographic hash functions. Moreover, CRC-128 and non-
cryptographic hash functions such as JenkinHash, FNVHash and
City Hash have approximately the same collision probability. As a
result, we can use the City hash, FNVHash or JenkinHash instead
of CRC-128 to speed up our system.
Figure 6. Speed of hash functions
Figure 7. Leaked files
Fig. 7 shows the number of leaked files when we checked with
1000 random files. The number of leaked files waives from 4 to 7.
The Bloom filter approach works much better than the Fingerprint
Comparison approach (FCA)[1] because it saves memory usage
and CPU processing. In our experiments, when the fingerprints
for all the files were created, the Bloom filter required just
5.12MB with the given expected detection lag of 1000 bytes. In
comparison with the FCA, the Bloom filter needed 50 times less
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
second
0
1
2
3
4
5
6
7
8
0
200
400
600
800
1000
1200
1400
1600
1800
2000
second # leaked files
memory than FCA’s. This is because FCA has to stores each
fingerprint instead of just asserting a single bit as in the Bloom
filter. The Bloom filter does not need to compare the whole
fingerprint when it checks an incoming string. As a result, this
improves the throughput when users download documents from
our database.
Fig. 8 shows the average detection lag when we apply different
hash functions to create fingerprints. The results show that the
detection lags are not much different. This means that using the
City hash could improve the system’s throughput while keeping a
comparable leaking rate.
Figure 8. Average detection lag
6. CONCLUSION AND DISCUSSION
In this paper, we applied the non-cryptographic hash functions to
create fingerprints for documents which are used to detect and
prevent the leaked data. Our experiments’ results show the
proposed method has improved the system’s throughput while
keeping the same level of data leaking in comparison with other
approaches in [1].
In our experiments, we used a single core CPU with limited RAM
– 4 GB – because of the shortage of resources. In the practical
cloud environment, the result could be far better since the
available RAM would be much larger. The system could take this
advantage to reduce the false positive rates. We expect a
significant improvement if this approach is deployed in a multi-
core CPU system.
7. REFERENCES
[1] Fang Hao, Kodialam, M., Lakshman, T.V., Puttaswamy,
K.P.N., "Protecting cloud data using dynamic inline fingerprint
checks," INFOCOM, 2013 Proceedings IEEE , vol., no.,
pp.2877,2885, 14-19, April 2013.
[2] Burton H. Bloom. “Space/time trade-offs in hash coding with
allowable errors”. Communications of the ACM 13, 422-426, July
1970.
[3] Data Loss DB http://www.datalossdb.org/statistics
[4] Mydlp - data leak prevention, http://www.mydlp.com.
[5] Cisco, “Cisco ironport data loss prevention,”
http://www.ironport.com/kr/technology/ironport dlp
overview.html.
[6] S. Yoshihama, T. Mishina, and T. Matsumoto, “Web-based
data leakage prevention,” in Proceedings of IWSEC, 2010.
[7] http://www.websense.com/content/home.aspx - WebDLP
[8] Sergey Butakov, “Using Bloom Filters in Data Leak
Protection Applications.” DART@AI*IA, volume 1109 of CEUR
Workshop Proceedings, page 13-24. CEUR-WS.org, 2013.
[9] Fang Hao, Murali Kodialam, and T. V. Lakshman. “Building
high accuracy bloom filters using partitioned hashing”. In
Proceedings of ACM SIGMETRICS '07, New York, USA, 2007.
[10] http://tools.ietf.org/html/rfc3174 - SHA Algorithm
[11] http://www.ietf.org/rfc/rfc1321.txt - MD5 - MD5 Algorithm
[12] https://code.google.com/p/smhasher/ - Murmur 3
[13] http://burtleburtle.net/bob/hash/doobs.html - Jenkins Hash
[14] https://tools.ietf.org/html/draft-eastlake-fnv-07 - FVNHash
Algorithm
[15] https://code.google.com/p/cityhash/ - CityHash Algorithm
184.8
185
185.2
185.4
185.6
185.8
186
186.2
186.4
second

More Related Content

What's hot

nearduplicates2006
nearduplicates2006nearduplicates2006
nearduplicates2006Hiroshi Ono
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and ClusteringAnkur Shrivastava
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceeSAT Journals
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangEugine Kang
 
Opportunistic Linked Data Querying through Approximate Membership Metadata
Opportunistic Linked Data Querying through Approximate Membership MetadataOpportunistic Linked Data Querying through Approximate Membership Metadata
Opportunistic Linked Data Querying through Approximate Membership MetadataMiel Vander Sande
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...
Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...
Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...Eswar Publications
 
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 
Exponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLEExponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLEIOSR Journals
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
 
Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...
Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...
Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...Eswar Publications
 
GPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity JoinsGPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity JoinsMateus S. H. Cruz
 
AETG_Project_2
AETG_Project_2AETG_Project_2
AETG_Project_2Mark Short
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongeSAT Publishing House
 

What's hot (20)

nearduplicates2006
nearduplicates2006nearduplicates2006
nearduplicates2006
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
HPC_NIST_SHA3
HPC_NIST_SHA3HPC_NIST_SHA3
HPC_NIST_SHA3
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevance
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine Kang
 
Opportunistic Linked Data Querying through Approximate Membership Metadata
Opportunistic Linked Data Querying through Approximate Membership MetadataOpportunistic Linked Data Querying through Approximate Membership Metadata
Opportunistic Linked Data Querying through Approximate Membership Metadata
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...
Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...
Outlier Detection in Secure Shell Honeypot using Particle Swarm Optimization ...
 
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Exponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLEExponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLE
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
 
Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...
Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...
Markle Tree Based Authentication Protocol for Lifetime Enhancement in Wireles...
 
GPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity JoinsGPU Acceleration of Set Similarity Joins
GPU Acceleration of Set Similarity Joins
 
Spam Detection Using Natural Language processing
Spam Detection Using Natural Language processingSpam Detection Using Natural Language processing
Spam Detection Using Natural Language processing
 
AETG_Project_2
AETG_Project_2AETG_Project_2
AETG_Project_2
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 

Similar to On Improving the Performance of Data Leak Prevention using White-list Approach

Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmEditor IJCATR
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot netredpel dot com
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...khalil IBRAHIM
 
Application Of Extreme Value Theory To Bursts Prediction
Application Of Extreme Value Theory To Bursts PredictionApplication Of Extreme Value Theory To Bursts Prediction
Application Of Extreme Value Theory To Bursts PredictionCSCJournals
 
Bloom Filters: An Introduction
Bloom Filters: An IntroductionBloom Filters: An Introduction
Bloom Filters: An IntroductionIRJET Journal
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
 
Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App IJECEIAES
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
 
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAMINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAcscpconf
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data csandit
 
A Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmA Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmIOSR Journals
 
Discrete structure ch 3 short question's
Discrete structure ch 3 short question'sDiscrete structure ch 3 short question's
Discrete structure ch 3 short question'shammad463061
 

Similar to On Improving the Performance of Data Leak Prevention using White-list Approach (20)

Mcs 021
Mcs 021Mcs 021
Mcs 021
 
Mcs 021 solve assignment
Mcs 021 solve assignmentMcs 021 solve assignment
Mcs 021 solve assignment
 
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic AlgorithmCongestion Control in Wireless Sensor Networks Using Genetic Algorithm
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot net
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...
 
Application Of Extreme Value Theory To Bursts Prediction
Application Of Extreme Value Theory To Bursts PredictionApplication Of Extreme Value Theory To Bursts Prediction
Application Of Extreme Value Theory To Bursts Prediction
 
Bloom Filters: An Introduction
Bloom Filters: An IntroductionBloom Filters: An Introduction
Bloom Filters: An Introduction
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
I0343047049
I0343047049I0343047049
I0343047049
 
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAMINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data
 
H017124652
H017124652H017124652
H017124652
 
A Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmA Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient Algorithm
 
Discrete structure ch 3 short question's
Discrete structure ch 3 short question'sDiscrete structure ch 3 short question's
Discrete structure ch 3 short question's
 

On Improving the Performance of Data Leak Prevention using White-list Approach

  • 1. On Improving the Performance of Data Leak Prevention using White-list Approach An-Trung B. Nguyen1 Faculty of Information Technology, University of Science Hochiminh City, Vietnam (84)976205232 1012482@student.hcmus.edu.vn Trung H. Dinh2 Faculty of Information Technology, University of Science Hochiminh City, Vietnam (84)988913415 1012481@student.hcmus.edu.vn Dung T. Tran3 Faculty of Information Technology, University of Science Hochiminh City, Vietnam (84)932408618 ttdung@fit.hcmus.edu.vn ABSTRACT Fang Hao et al. [1] proposed a Data Leakage Prevention (DLP) model by combining 128-bit CRC (Cyclic Redundancy Check) and the Bloom filter [2]. In this paper, we improve on the work done by Fang Hao et al.[1] by using hash functions combining with the Bloom filter. The experiment results showed that our approach has significantly improved the system’s throughput. In our experiments, we used a 9.3 GB of html files with more than 121,000 files. Keywords Data leak prevention, bloom filter, fingerprint, hash function. 1. INTRODUCTION In recent years, the trend of storing data on the cloud has been growing fast because of the convenience which comes from this kind of service. Looking beyond the significant advantages, there is an important problem that users and service providers have to consider: the safety of the data. In the cloud environment, malicious users are always ready hijack or steal sensitive data for a wide variety of purposes. As a result, cloud service providers are always looking for new technologies to improve the level of security which makes such activity more difficult or impossible. Figure 1 The number of incidents over time (datalossdb.org) As shown in fig. 1 – taken from DatalossDB [3] – there were over 115 data leakage incidents each month during 2013. Some of these incidents affected millions of users. Such leakages happened despite protection mechanisms such as firewalls and IPS/IDS devices. It seems that current protection tools have failed to prevent the zero-day attacks, application level compromises, bugs or misconfigurations. Moreover, most incidents come from the outside. This demands new approaches to protect data in cloud storages. Figure 2 Incident source As we mentioned earlier, the DLP model – proposed by Fang Hao [1] in order to prevent the data leakage – builds a fingerprint database sitting at the border of the networks: all network traffic has to be inspected. If there is sensitive data, the traffic will be stopped to minimize the number of leaking bytes. Fang Hao [1] used a 128-bit CRC to create the fingerprint. This approach is simple but not efficient for the system’s performance, since building a database requires system resources and every check for the data leak need to re-run CRC. We propose an alternative approach which improves the process of creating the fingerprint while still keeping the fingerprint collision at a low rate. Checking every single byte of outgoing files is impractical in terms of memory usage and computational cost so the outgoing files should be checked at some sample points, say t1<t2<t3<…<tk. Those checking points will be determined at the beginning. Based on these checking points, we create the fingerprint for every segment of every file in the database and store them by using the Bloom filter [2]. When users try to download a file, the DLP will divide the file into many parts according to the checking points; and then calculate their fingerprints which are compared with the ones stored in the Bloom filter. If any mismatches occur, this means that the file is not on the white list. Consequently, the transmission will be immediately terminated.
  • 2. The initial step is one of the most important steps in our proposal. Since the Bloom filter is a probabilistic data structure, the size of the filter is directly related to the false positive probability. On the other hands, the collision probability of hash functions is also an important impact factor to the performance. We will show how this is so in the experiment section. In this work, we conduct the experiments of five innovative hash functions from two popular groups: the cryptographic and non- cryptographic hash functions. We use the system’s throughput and the percentage of leaked files to evaluate the hash functions. This paper has 6 sections. Section 1, the Introduction, sets up the purpose of the study and determines the standards that will be utilized to measure the results as well as touches on other similar studies. We summarize related work in Section 2. In section 3, we present the Bloom filter, and then the hash functions in section 4. We present the implementation and experiments’ results in section 5. Our conclusions, with discussion, are presented in section 6. 2. RELATED WORK MyDLP[4], IronPort[5] have used a similar approach which is a network-based prevention model. They applied the two following techniques: - Use a black-list to detect data leak. - Use keywords or regular search expressions to detect the same. For instance, WebDLP[6,7], IronPort[5] use similarity based content classifiers to identify content, then compare it to the confidential black-list contents. My DLP[6] also uses a black-list combined with keywords, regular expressions, and full-file hashes to prevent the data leak problem. These approaches are not fool proof in that attacker can encrypt or transform content to avoid the black-list. False positives can also be high because valid data may contain some of the same keywords in its documents. Conversely, Fang Hao et. al. [1] proposed a different approach by extracting fingerprints from each document in white-list, and building a fingerprint database store at the network boundary. For each data transmission connection, they inspect outgoing data, calculate fingerprints of data and check against the database. If the fingerprint at any checkpoint of the transmission does not match the database, this transmission is terminated. To reduce overhead of memory, Fang Hao [1] proposed an algorithm to optimize the checkpoints’ locations while restraining the worst case data leakage length. Figure 3. Add an element to bloom filter To develop from the work of Fang Hao et. al.[1], we improved the process speed at the fingerprint creation step. Our approach is faster but with the same accuracy. We used several different hash functions to generate fingerprints of documents. Our purpose was to find a hash function with a high speed and a lower amount of collision. In order words, to create fingerprints faster, yet still ensure their accuracy. 3. BLOOM FILTER A Bloom filter is a data structure designed to tell you, rapidly and with low memory overhead, whether an item is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: if it tells us that the element is not in set, this element will not be definitely in the set. By contrast, this element may be or not may be in the set with probability depends on number of bits of the Bloom filter. In general, to add an item to the set, we use some hash functions to calculate the hash values of the item. After that, these hash values will be the index values used as the input to the bloom filter. The idea above is to allocate an array v of m bits, initially all set to 0, and then choose k independent hash functions, {h1,h2,h3,..,hk}, each with range (1...m). For each element a A, the bits at positions h1(a), h2(a), ..., hk(a) in v are set to 1. It might lead to a particular bit being set to 1 multiple times. Given a query for b to check the membership we check the bits at positions h1(b), h2(b), ..., hk(b). If any of them is 0, then certainly b is not in the set A. Otherwise we conjecture that b is in the set although there is a certain probability that we are wrong. This is called a “false positive”. The parameters k and m should be chosen such that the probability of a false positive (and hence a false hit) is acceptable.
  • 3. Figure 3 Check membership of an item The salient feature of Bloom filters is that there is a clear trade-off between m and the probability of a false positive. According to [8,9], after inserting n keys into a table of size m, the probability that a particular bit is still 0 is: (1) Hence, the probability of a false positive in this situation is (2) The right hand side of (2) is minimum when , it implies: (3) In fact k must be an integer, in practice we might chose a value less than the optimal value to reduce the memory usage. In order to obtain the expected false positive, we have to consider the trade-off between the Bloom filter’s size and the false positive probability. In order words, a system with larger physical memory can achieve a better false positive value. In our approach, the index of the Bloom filter is calculated from the hash value of the item which is the segment between 2 checking points of a document. Hash collision occurs when two different items give the same hash value. In this paper, we use experiments to find out the best hash function which gives both a higher speed and a lower collision rate. The next section will analyze the collision of hash functions. 4. HASH FUNCTION 4.1 Collision probabilities Given a set of N distinct hash codes, picked one value. The N−1 remaining hash codes (out of N possible hash codes) are also distinct. Therefore, the probability of randomly generating two distinct hash codes out of N possible codes is . Consequently, the probability of randomly generating three distinct hash codes is . Since generating a random hash code is an independent event, the probability of generating k distinct hash codes is the multiplication of the probabilities of generating each single hash code. In general, the probability of generating k distinct hash codes is: (4) This would be computational difficult to find a large k. Fortunately, the equation (4) can be approximated by: (5) Figure 5 illustrates the probability of collision when using 32-bit hash codes. It’s worth noting that collision occurs with probability of 1/2 when the number of hashes (x-axis) is around 77000. We also note that the graph takes the same S-curved shape for any value of N. Since the hash code will be the index of the bloom filter, the hash function must have two properties: be fast and provide less collision. 4.2 Hash function selection The fingerprint has to be long enough to make the probability of two different items having the same fingerprint infinitesimally small. For practical applications, a 128 or 256-bit fingerprint would be sufficient. In our experiments, we examine two types of hashing: cryptographic and non-cryptographic. Figure 5. Collision of 32-bit Hash function 5. IMPLEMENTATION AND EXPERIMENTS The implementation of our approach consists of three steps:  Preprocessing  Filter construction  On-line checking 5.1 Preprocessing We first examine the existing file set and generate the file size distribution. Then we run the dynamic programming algorithm that mentioned in the paper [1] to compute the optimal filter placement strategy, including filter locations, number of hash functions, and filter sizes. There is usually a straightforward mapping between the expected false positive probability and total amount of memory needed for the Bloom filter. 5.2 Filter construction In this step, we construct the actual Bloom filter based on the optimal filter computed in previous step. In general, Bloom filters may need multiple hash functions. This can be achieved by 0 20 40 60 80 100 120 0 100000 200000 300000 # hashes % collision
  • 4. appending multiple random numbers at the end of the data and computing a new value corresponding to each hash function. We then use the hash value mod m – the number of bits in the filter – to map the hash values into a bit index in the Bloom filter. We repeat this process for all files. 5.3 On-line checking Once the filters are constructed, the system is ready to launch. At this step, data traffic is processed and the hash checksums are continuously computed. Finally, the Bloom filter checking is done at each pre-determined location. A data flow is allowed to continue if the Bloom filter gives a hit result. Otherwise the flow is dropped. To minimize the leakage of small files, we mandate a check at the end of the data flow, before releasing the last packet. In this way, small files that are contained in a single packet can always be checked. 5.4 Experiment scenarios We used 9.3GB of data to evaluate our algorithm. In order to collect the data, we used wget command in Linux to download files on the Internet. In this paper, we crawled on the following web pages: - www.tools.ietf.org - www.tuoitre.vn - www.bbc.co.uk - www.nld.com.vn These web pages are news and scientific pages. Downloaded files were classified by the file type, and the size of files. To automatically download the whole site, we used the following command: wget -x --convert-links -r address of website In the experiments with the Bloom filter, data files were processed following the three implementation steps described in the previous sections. At the last step, to simulate data leakage, we randomly selected 1000 files, and then randomly selected one byte within each file as the “bad byte” to change its value. For each file, we computed the incremental hash value up to each Bloom filter location, and checked if such fingerprints were present at the location. The bad byte was detected when the check failed. The difference between the detection location and location of the “bad byte” is the number of bytes that are leaked. This is called the detection lag. 5.5 Experimental results Firstly, we used five different hash functions to calculate all the fingerprints for the same html file set. According to the experiments, the cryptographic hash functions such as SHA1[10], MD5[11] spent more time than the non-cryptographic ones such as Murmur3[12], JinkinHash[13], FNVHash[14] and CityHash[15]. The experiments’ results show that the number of leaked files was around 4-5 out of 1000 randomly selected files. It means the false positive probability is approximately 0.5 percent. Fig. 6 shows the comparison of hash functions’ speeds when they hashed all of the data files. The y-axis is the total time needed to create all the fingerprints. This result showed that City hash is the fastest hash function. City hash is faster than CRC128 by approximately 200 seconds. CRC128, SHA1 and MD5 took around 1500 seconds longer than the City hash. JenkinHash was as fast as the FNVHash. According to the above results, the hash functions in non- cryptographic hash groups have better speed than in cryptographic hash groups. Although CRC-128 is also a non-cryptographic hash function, its speed is slower than SHA1 and MD5 – the cryptographic hash functions. Moreover, CRC-128 and non- cryptographic hash functions such as JenkinHash, FNVHash and City Hash have approximately the same collision probability. As a result, we can use the City hash, FNVHash or JenkinHash instead of CRC-128 to speed up our system. Figure 6. Speed of hash functions Figure 7. Leaked files Fig. 7 shows the number of leaked files when we checked with 1000 random files. The number of leaked files waives from 4 to 7. The Bloom filter approach works much better than the Fingerprint Comparison approach (FCA)[1] because it saves memory usage and CPU processing. In our experiments, when the fingerprints for all the files were created, the Bloom filter required just 5.12MB with the given expected detection lag of 1000 bytes. In comparison with the FCA, the Bloom filter needed 50 times less 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 second 0 1 2 3 4 5 6 7 8 0 200 400 600 800 1000 1200 1400 1600 1800 2000 second # leaked files
  • 5. memory than FCA’s. This is because FCA has to stores each fingerprint instead of just asserting a single bit as in the Bloom filter. The Bloom filter does not need to compare the whole fingerprint when it checks an incoming string. As a result, this improves the throughput when users download documents from our database. Fig. 8 shows the average detection lag when we apply different hash functions to create fingerprints. The results show that the detection lags are not much different. This means that using the City hash could improve the system’s throughput while keeping a comparable leaking rate. Figure 8. Average detection lag 6. CONCLUSION AND DISCUSSION In this paper, we applied the non-cryptographic hash functions to create fingerprints for documents which are used to detect and prevent the leaked data. Our experiments’ results show the proposed method has improved the system’s throughput while keeping the same level of data leaking in comparison with other approaches in [1]. In our experiments, we used a single core CPU with limited RAM – 4 GB – because of the shortage of resources. In the practical cloud environment, the result could be far better since the available RAM would be much larger. The system could take this advantage to reduce the false positive rates. We expect a significant improvement if this approach is deployed in a multi- core CPU system. 7. REFERENCES [1] Fang Hao, Kodialam, M., Lakshman, T.V., Puttaswamy, K.P.N., "Protecting cloud data using dynamic inline fingerprint checks," INFOCOM, 2013 Proceedings IEEE , vol., no., pp.2877,2885, 14-19, April 2013. [2] Burton H. Bloom. “Space/time trade-offs in hash coding with allowable errors”. Communications of the ACM 13, 422-426, July 1970. [3] Data Loss DB http://www.datalossdb.org/statistics [4] Mydlp - data leak prevention, http://www.mydlp.com. [5] Cisco, “Cisco ironport data loss prevention,” http://www.ironport.com/kr/technology/ironport dlp overview.html. [6] S. Yoshihama, T. Mishina, and T. Matsumoto, “Web-based data leakage prevention,” in Proceedings of IWSEC, 2010. [7] http://www.websense.com/content/home.aspx - WebDLP [8] Sergey Butakov, “Using Bloom Filters in Data Leak Protection Applications.” DART@AI*IA, volume 1109 of CEUR Workshop Proceedings, page 13-24. CEUR-WS.org, 2013. [9] Fang Hao, Murali Kodialam, and T. V. Lakshman. “Building high accuracy bloom filters using partitioned hashing”. In Proceedings of ACM SIGMETRICS '07, New York, USA, 2007. [10] http://tools.ietf.org/html/rfc3174 - SHA Algorithm [11] http://www.ietf.org/rfc/rfc1321.txt - MD5 - MD5 Algorithm [12] https://code.google.com/p/smhasher/ - Murmur 3 [13] http://burtleburtle.net/bob/hash/doobs.html - Jenkins Hash [14] https://tools.ietf.org/html/draft-eastlake-fnv-07 - FVNHash Algorithm [15] https://code.google.com/p/cityhash/ - CityHash Algorithm 184.8 185 185.2 185.4 185.6 185.8 186 186.2 186.4 second