Corporate systems can be secured using an enormous quantity of methods, and the implementation of Black or White lists is among them.
With these lists it is possible to restrict (or to allow) the users the execution of applications or the access to certain URLs, among others. This paper is focused on the latter option. It describes the whole processing of a set of data composed by URL sessions performed by the employees of a company; from the preprocessing stage, including labelling and data balancing processes, to the application of several classification algorithms. The aim is to define a method for automatically make a decision of allowing or denying future URL requests, considering a set of corporate security policies.
Thus, this work goes a step beyond the usual black and white lists, since they can only control those URLs that are specifically included in them, but not by making decisions based in similarity (through classification techniques), or even in other variables of the session, as it is proposed here.
The results show a set of classification methods which get very good classification percentages (95-97%), and which infer some useful rules based in additional features (rather that just the URL string) related to the user's access. This led us to consider that this kind of tool would be very useful tool for an enterprise.
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers
1. Going a Step Beyond the Black
and White Lists for URL Accesses
in the Enterprise by means of
Categorical Classifiers
Authors:
Antonio Miguel Mora García
Paloma de las Cuevas Delgado
Juan Julián Merelo Guervós
ECTA 2014, Rome, Italy
5. Underlying Problem
Enterprise Security applied to employees’ connections to the
Internet (URL requests).
4
www
● Proxies
● Firewalls
● Corporate Security Policies (CSP) which may
include Blacklists and Whitelists
6. What do Black and White lists cover?
● Every URL inside a Blacklist is denied, if not, it is allowed.
What if something is directly allowed but it should not be?
● Every URL inside a Whitelist is allowed, if not, it is denied.
What if something is directly denied but it should not be?
Therefore, we want to go a step beyond.
5
7. ● Objective → to obtain a tool for automatically making an
allowance or denial decision with respect to URLs that are
not included in the black/whitelists.
o This decision would be based in the one made for similar URL
accesses (those with similar features).
o The tool should consider other parameters of the request in
addition to the URL string.
Objectives
6
8. Followed Schema
Unlabelled Labelled requests
requests
Classification
accuracies and Rules Classification
methods
7
Data Mining Labelling Process
Analysis of results Machine Learning
9. Working Scenario
Employees requesting accesses to URLs (records from an actual
Spanish company - around 100 employees) from 8 to 10 am.
8
www
● Log File of 100k entries (patterns). CSV file format.
● A set of rules (specification of the security policies
on if-then clauses).
10. Data description: Entries in the Log
● An Entry (unlabelled)
● It has 7 categorical fields and 3 numerical fields.
● Leads to classification which support both types:
o Rule based classifiers
o Tree based classifiers
http_reply_
code
http_metho
d
duration_
miliseconds
content_type server_or_
cache_address
time squid_hierarchy bytes url client_
adress
200 GET 1114 application/octet-stream
X.X.X.X 08:30:0
8
DEFAULT_PARENT 10696
1
http://www.on
e.example.com
X.X.X.X
9
11. Data description: Policies and Rules
● A Policy and a Rule
“Video streamings cannot be reproduced”
rule "policy-1 MP4"
attributes
when
squid:Squid(dif_MCT=="video",bytes>1000000,
content_type matches "*.application.*,
url matches "*.p2p.* )
then
PolicyDecisionPoint.deny();
end
● It has a set of conditions, and a decision (ALLOW/DENY).
● Each condition has: Data Type, Relationship, Value.
10
12. Labelling Process
● The two data sets are compared during the labelling process.
● Conditions of each rule are checked in each entry/request.
● If an entry meets all conditions, it is labelled with the
corresponding decision of the rule.
When
- Entry meets conditions of a rule that allows making the request.
AND - Entry meets conditions of a rule that denies making the request.
THEN - DENY is chosen.
11
13. Data Summary
● The CSV file, now with all the patterns that could be labelled
(the others were not covered by the rules), has 57502
entries/patterns:
o 38972 with an ALLOW label.
o 18530 with a DENY label.
2:1 ratio
● Application of data balancing techniques:
o Undersampling: random removal of patterns in majority class.
o Oversampling: duplication of each pattern in minority class.
12
14. Experimental Setup
● The classifiers are tested, firstly, with a 10-fold cross-validation
process.
o Top five classifiers in accuracy, are chosen for the following
experiments.
o Also, Naïve Bayes classifier is taking as a reference.
● Secondly, a division process is performed over the initial
(labelled) log file, into both training and test files.
● These training and test files are created with different ratios
and either taking the entries randomly or sequentially.
13
15. Flow Diagram
1) Initial labelling process.
Experiments with unbalanced, and balanced
data. From those, divisions are made:
● 80% training 20% testing
● 90% training 10% testing
Randomly, and sequentially.
3) Enhancing the creation of training and test files.
Experiments with unbalanced data. From those, divisions
are made, patterns randomly taken:
● 80% training 20% testing
● 90% training 10% testing
● 60% training 40% testing
2) Removal of duplicated requests.
Experiments with unbalanced data. From
those, divisions are made:
● 80% training 20% testing
● 90% training 10% testing
● 60% training 40% testing
Randomly, and sequentially.
4) Filtering the features of the URL.
Experiments with unbalanced, and balanced data.
From those, divisions are made, patterns
randomly taken:
● 80% training 20% testing
● 90% training 10% testing
● 60% training 40% testing
14
16. 10-fold cross-validation experiments
1) Initial labelling process.
● The classifiers are tested, firstly, with a 10-fold cross-validation process
over the balanced data.
15
17. Using separate training/test files
1) Initial labelling process.
● Naïve Bayes and top five classifiers are tested with training and test
divisions, in order to avoid testing patterns being used for training and
vice versa.
16
18. Serendipity rocks
1) Initial labelling process.
Divisions made over unbalanced data
17
19. Results continue falling
1) Initial labelling process.
Divisions made over balanced data (undersampling)
18
20. Results continue falling
1) Initial labelling process.
Divisions made over balanced data (oversampling)
19
21. Why are accuracies still high?
2) Removal of duplicated requests.
● We studied the field squid_hierarchy and saw that had two possible
values: DIRECT or DEFAULT_PARENT.
http_reply_
code
http_metho
d
duration_
miliseconds
content_type server_or_
cache_address
time squid_hierarchy bytes url client_
adress
200 GET 1114 application/octet-stream
X.X.X.X 08:30:0
8
DEFAULT_PARENT 10696
1
http://www.on
e.example.com
X.X.X.X
20
22. Repeated entries affect accuracies
2) Removal of duplicated requests.
● The connections are made, firstly, to the Squid proxy, and then, if
appropriate, the request continues to another server.
o Then, some of the entries were repeated, and results may be affected for
that.
21
www
“Some local IP” 192.194.2.2 “Some server IP”
23. Serendipity rocks again
2) Removal of duplicated requests.
Divisions made over unbalanced data
22
24. Where are the URL features going?
3) Enhancing the creation of training and test files.
● Repeated URL core domains could yield to false results.
● During the division process, we ensured that requests with the same
URL core domain went to the same file (either for training or for
testing).
23
25. Accuracies fall down automatically
3) Enhancing the creation of training and test files.
24
26. Created Rules During Classification
● In the experiments that included only the URL core domain as a
classification feature, rules were too focused on that feature.
PART decision list
------------------
url = dropbox: deny (2999.0)
url = ubuntu: allow (2165.0)
url = facebook: deny (1808.0)
url = valli: allow (1679.0)
25
27. Created Rules During Classification
● Another kind of rules were found, but always dependant on
the URL core domain.
url = grooveshark AND
http_method = POST: allow (733.0)
url = googleapis AND
content_type = text/javascript AND
client_address = 192.168.4.4: allow (155.0/2.0)
url = abc AND
content_type_MCT = image AND
time <= 31532000: allow (256.0)
26
28. Training with other URL features
4) Filtering the features of the URL.
● Rules created by the classifiers are too focused on the URL core domain
feature.
● We did the experiments again with the original file, but including as a
feature only the Top Level Domain of the URL, and not the core domain.
27
29. Random Forest defeats everyone
4) Filtering the features of the URL.
Divisions made over balanced data
28
30. Created Rules During Classification
● After including the URL top level domain as a classification feature,
instead of URL core domain, rules classify mainly by server
address.
PART decision list
------------------
server_or_cache_address = 173.194.34.248: allow (238.0/1.0)
server_or_cache_address = 91.121.155.13: deny (235.0)
server_or_cache_address = 90.84.53.48 AND
client_address = 10.159.39.199 AND
tld = es AND
time <= 31533000: allow (138.0/1.0)
29
31. Created Rules During Classification
● URL TLD appears, but now the rules are not always
dependant on this feature.
server_or_cache_address = 90.84.53.19 AND
tld = com: deny (33.0/1.0)
server_or_cache_address = 87.248.20.254 AND
content_type_MCT = image AND
duration_milliseconds > 21: deny (15.0)
server_or_cache_address = 23.38.17.224 AND
time > 30532000 AND
http_reply_code = 200 AND
content_type_MCT = image AND
bytes <= 520 AND
time <= 33677000: allow (40.0)
30
32. ● In most cases, Random Forest classifier is the one that yields
better results.
● The loss of information when analysing a Log of URL
requests lowers the results. This happens when:
o Oversampling data (because we randomly remove data).
o Keeping the sequence of the requests of the initial Log file while
making the division in training and test files.
Conclusions
31
33. Conclusions
● As seen in the rules obtained, it is possible to develop a tool
that automatically makes an allowance or denial decision
with respect to URLs, and that decision would depend on
other features of a URL request and not only the URL.
33
34. ● Making experiments with bigger data sets (e.g. a whole
workday).
● Include more lexical features of a URL in the experiments
(e.g. number of subdomains, number of arguments, or the
path).
● Consider sessions when classifying.
o Defining session as the set of requests that are made from a certain
client during a certain time).
● To finally implement a system and to prove them with real
data, in real-time.
Future Work
34
35. Thank you for your attention
Questions?
amorag@geneura.ugr.es
jmerelo@geneura.ugr.es
paloma@geneura.ugr.es
Twitter (@amoragar, @jjmerelo,
@unintendedbear)