Corporate workers increasingly use their own devices for work purposes, in a trend that has come to be called the "Bring Your Own Device" (BYOD) philosophy and companies are starting to include it in their policies. For this reason, corporate security systems need to be redefined and adapted, by the corporate Information Technology (IT) department, to these emerging behaviours. This work proposes applying soft-computing techniques, in order to help the Chief Security Officer (CSO) of a company (in charge of the IT department) to improve the
security policies.
The actions performed be company workers under a BYOD situation will be treated as events: an action or set of actions yielding to a response. Some of those events might cause a non compliance with some corporate policies, and then it would be necessary to define a set of security rules (action, consequence). Furthermore, the processing of the extracted knowledge will allow the rules to be adapted.
Applying soft computing techniques to corporate mobile security systems
1. Applying Soft Computing
Techniques to Corporate Mobile
Security Systems
Máster en Ingeniería de Computadores y
Redes
Paloma de las Cuevas Delgado
Dirigida por los Doctores:
Antonio Miguel Mora García
Juan Julián Merelo Guervós
2. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
3. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
7. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
8. Underlying Problem
● Enterprise Security applied to employees’ connections to the
Internet (URL requests).
● Security? How?
○ Proxy
○ Blacklists
○ Whitelists
○ Firewalls
○ Elaboration of Corporate Security Policies
List of URLs which are permitted (white) or not (black)
● The aim of this research is going a step beyond.
5
9. ● Objective → to obtain a tool for automatically making an
allowance or denial decision with respect to URLs that are
not included in the black/whitelists.
○ This decision would be based in the one made for similar URL
accesses (those with similar features).
○ The tool should consider other parameters of the request in
addition to the URL string.
Objectives
6
10. 1. Data Mining process
a. Parsing
b. Preprocessing
Followed Schema
2. Labelling process (requests labelled as ALLOW or DENY)
3. Machine Learning
4. Studying classification accuracies
7
11. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
12. Working Scenario
● Employees requesting accesses to URLs (records from an
actual Spanish company - around 100 employees) during
workday.
● Having access to a Log File of 100k entries (patterns) within
two hours (8 - 10 am). CSV file format.
● Also, we were provided with a set of rules (specification of
the security policies on if-then clauses).
9
13. ● An Entry (unlabelled)
● A Policy and a Rule
“Video streamings cannot be reproduced”
Data description
http_reply_
code
http_metho
d
duration_
miliseconds
content_type server_or_
cache_address
time squid_hierarchy bytes url client_
adress
200 GET 1114 application/octet-stream
X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www.
one.example.
com
X.X.X.X
rule "policy-1 MP4"
attributes
when
squid:Squid(dif_MCT=="video",bytes>1000000,
content_type matches "*.application.*,
url matches "*.p2p.* )
then
PolicyDecisionPoint.deny();
end 10
14. ● An Entry
○ Has 7 categorical fields and 3 numerical fields.
● A Rule
○ Has a set of conditions, and a decision (ALLOW/DENY).
○ Each condition has three parts:
■ Data Type (e.g. bytes)
■ Relationship (e.g. < )
■ Value (e.g. 1000000)
Data description
11
15. Tools used during this research
● Drools and Squid syntax for the rules, CSV format for Log
data.
● Weka, which has a great and state-of-the-art set of
classifiers.
● Two implementations:
○ Perl → faster in the parsing process, slower with the labelling
process and the use of weka.
○ Java → native implementation with weka, better for automation,
and it will be embedded in an actual Java project (MUSES).
12
16. After the parsing process
● A hash with the entries
○ Keys → Entry fields
○ Values → Field values
● A hash with the set of rules
○ Keys → Condition fields, and decision
○ Values → Name of the data type, its
desired value, relationship between them,
and allow, or deny.
%logdata = (
entry =>{
http_reply_code =>xxx
http_method =>xxx
duration_miliseconds =>xxx
content_type =>xxx
server_or_cache_address =>xxx
time =>xxx
squid_hierarchy =>xxx
bytes =>xxx
url =>xxx
client_address =>xxx
},
);
%rules = (
rule =>{
field =>xxx
relation =>xxx
value =>xxx
decision =>[allow, deny]
},
);
13
17. ● The two hashes are compared during the labelling process.
● Conditions of each rule are checked in each entry.
● If an entry meets all conditions, it is labelled with the
corresponding decision of the rule.
● A pair key-value is included in the hash of that entry, with
the decision.
● Conflict resolution:
Labelling Process
○ Entry meets conditions of a rule that allows making the request.
○ Entry meets conditions of a rule that denies making the request.
14
18. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
19. Data Summary
● The CSV file, now with all the patterns that could be labelled
(the others were not covered by the rules), has 57502
entries/patterns:
○ 38972 with an ALLOW label.
○ 18530 with a DENY label.
2:1 ratio
● It might be needed to apply data balancing techniques:
○ Undersampling: random removal of patterns in majority class.
○ Oversampling: duplication of each pattern in minority class.
16
20. Experimental Setup
● The classifiers are tested, firstly, with a 10-fold cross-validation
process.
○ Top five classifiers in accuracy, are chosen for the following
experiments.
○ Also, Naïve Bayes classifier is taking as a reference.
● Secondly, a division process is performed over the initial
(labelled) log file, into both training and test files.
● These training and test files are created with different ratios
and either taking the entries randomly or sequentially.
17
21. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
22. Flow Diagram
1) Initial labelling process.
Experiments with unbalanced, and balanced
data. From those, divisions are made:
● 80% training 20% testing
● 90% training 10% testing
Randomly, and sequentially.
3) Enhancing the creation of training and test files.
Experiments with unbalanced data. From those, divisions
are made, patterns randomly taken:
● 80% training 20% testing
● 90% training 10% testing
● 60% training 40% testing
2) Removal of duplicated requests.
Experiments with unbalanced data. From
those, divisions are made:
● 80% training 20% testing
● 90% training 10% testing
● 60% training 40% testing
Randomly, and sequentially.
4) Filtering the features of the URL.
Experiments with unbalanced, and balanced data.
From those, divisions are made, patterns
randomly taken:
● 80% training 20% testing
● 90% training 10% testing
● 60% training 40% testing
19
23. First set of experiments
1) Initial labelling process.
● The classifiers are tested, firstly, with a 10-fold cross-validation process
over the balanced data.
20
24. First set of experiments
1) Initial labelling process.
● Naïve Bayes and top five classifiers are tested with training and test
divisions, in order to avoid testing patterns being used for training and
vice versa.
21
25. First set of experiments
1) Initial labelling process.
Divisions made over unbalanced data
22
26. First set of experiments
1) Initial labelling process.
Divisions made over balanced data (undersampling)
23
27. First set of experiments
1) Initial labelling process.
Divisions made over balanced data (oversampling)
24
28. ● We studied the field squid_hierarchy and saw that had two possible
values: DIRECT or DEFAULT_PARENT.
● The connections are made, firstly, to the Squid proxy, and then, if
appropriate, the request continues to another server.
○ Then, some of the entries were repeated, and results may be affected for
that.
Second set of experiments
2) Removal of duplicated requests.
http_reply_
code
http_metho
d
duration_
miliseconds
content_type server_or_
cache_address
time squid_hierarchy bytes url client_
adress
200 GET 1114 application/octet-stream
X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www.
one.example.
com
X.X.X.X
25
29. Second set of experiments
2) Removal of duplicated requests.
Divisions made over unbalanced data
26
30. Third set of experiments
3) Enhancing the creation of training and test files.
● Repeated URL core domains could yield to false results.
● During the division process, we ensured that requests with the same URL
core domain went to the same file (either for training or for testing).
27
31. Third set of experiments
3) Enhancing the creation of training and test files.
28
32. Created Rules During Classification
● In the experiments that included only the URL core domain as a
classification feature, rules were too focused on that feature.
PART decision list
------------------
url = dropbox: deny (2999.0)
url = ubuntu: allow (2165.0)
url = facebook: deny (1808.0)
url = valli: allow (1679.0)
29
33. Created Rules During Classification
● Another kind of rules were found, but always dependant on
the URL core domain.
url = grooveshark AND
http_method = POST: allow (733.0)
url = googleapis AND
content_type = text/javascript AND
client_address = 192.168.4.4: allow (155.0/2.0)
url = abc AND
content_type_MCT = image AND
time <= 31532000: allow (256.0)
30
34. Fourth set of experiments
4) Filtering the features of the URL.
● Rules created by the classifiers are too focused on the URL core domain
feature.
● We did the experiments again with the original file, but including as a
feature only the Top Level Domain of the URL, and not the core domain.
31
35. Fourth set of experiments
4) Filtering the features of the URL.
Divisions made over unbalanced data
32
36. Fourth set of experiments
4) Filtering the features of the URL.
Divisions made over balanced data
33
37. Created Rules During Classification
● After including the URL top level domain as a classification feature,
instead of URL core domain, rules classify mainly by server
address.
PART decision list
------------------
server_or_cache_address = 173.194.34.248: allow (238.0/1.0)
server_or_cache_address = 8.27.153.126: allow (235.0/2.0)
server_or_cache_address = 91.121.155.13: deny (235.0)
server_or_cache_address = 66.220.152.19: deny (201.0)
34
38. Created Rules During Classification
● URL TLD appears, but now the rules are not always
dependant on this feature.
server_or_cache_address = 90.84.53.48 AND
client_address = 10.159.39.199 AND
tld = es AND
time <= 31533000: allow (138.0/1.0)
content_type = application/octet-stream AND
tld = com AND
server_or_cache_address = 192.168.4.4 AND
client_address = 10.159.86.22: allow (210.0)
server_or_cache_address = 90.84.53.19 AND
tld = com: deny (33.0/1.0)
35
39. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
40. ● In most cases, Random Forest classifier is the one that yields
better results.
● The loss of information when analysing a Log of URL
requests lowers the results. This happens when:
○ Oversampling data (because we randomly remove data).
○ Keeping the sequence of the requests of the initial Log file while
making the division in training and test files.
Conclusions
37
41. Conclusions
● For future experiments, it should be ensured that same URL
lexical features (like the core domain) are not in both
training and test files at the same time.
○ This wrongs the results.
● As seen in the rules obtained, it is possible to develop a tool
that automatically makes an allowance or denial decision
with respect to URLs, and that decision would depend on
other features of a URL request and not only the URL.
38
42. Scientific Contributions
● MUSES: A corporate user-centric system which applies
computational intelligence methods, at ACM SAC conference,
Gyeongju, Korea, March 2014.
● Enforcing Corporate Security Policies via Computational
Intelligence Techniques, at SecDef Workshop at GECCO,
Vancouver, July 2014.
● Going a Step Beyond the Black and White Lists for URL Accesses
in the Enterprise by means of Categorical Classifiers, at ECTA,
Rome, Italy, October 2014.
39
43. 1. Research context.
2. Underlying problem and objectives.
3. Data description and preprocessing.
4. Experimental setup.
5. Experiments and results.
6. Conclusions and scientific contributions.
7. Future Work.
Index
44. ● Making experiments with bigger data sets (e.g. a whole
workday).
● Include more lexical features of a URL in the experiments (e.
g. number of subdomains, number of arguments, or the
path).
● Consider sessions when classifying.
○ Defining session as the set of requests that are made from a certain
client during a certain time).
● To finally implement a system and to prove them with real
data, in real-time.
Future Work
41
45. Thank you for your attention
Questions?
paloma@geneura.ugr.es
Twitter @unintendedbear