Applying Soft Computing 
Techniques to Corporate Mobile 
Security Systems 
Máster en Ingeniería de Computadores y 
Redes 
Paloma de las Cuevas Delgado 
Dirigida por los Doctores: 
Antonio Miguel Mora García 
Juan Julián Merelo Guervós
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
Research Context 
1
Research Context 
● Bring Your Own Device problem 
2
Research Context 
● MUSES SERVER 
3
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
Underlying Problem 
● Enterprise Security applied to employees’ connections to the 
Internet (URL requests). 
● Security? How? 
○ Proxy 
○ Blacklists 
○ Whitelists 
○ Firewalls 
○ Elaboration of Corporate Security Policies 
List of URLs which are permitted (white) or not (black) 
● The aim of this research is going a step beyond. 
5
● Objective → to obtain a tool for automatically making an 
allowance or denial decision with respect to URLs that are 
not included in the black/whitelists. 
○ This decision would be based in the one made for similar URL 
accesses (those with similar features). 
○ The tool should consider other parameters of the request in 
addition to the URL string. 
Objectives 
6
1. Data Mining process 
a. Parsing 
b. Preprocessing 
Followed Schema 
2. Labelling process (requests labelled as ALLOW or DENY) 
3. Machine Learning 
4. Studying classification accuracies 
7
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
Working Scenario 
● Employees requesting accesses to URLs (records from an 
actual Spanish company - around 100 employees) during 
workday. 
● Having access to a Log File of 100k entries (patterns) within 
two hours (8 - 10 am). CSV file format. 
● Also, we were provided with a set of rules (specification of 
the security policies on if-then clauses). 
9
● An Entry (unlabelled) 
● A Policy and a Rule 
“Video streamings cannot be reproduced” 
Data description 
http_reply_ 
code 
http_metho 
d 
duration_ 
miliseconds 
content_type server_or_ 
cache_address 
time squid_hierarchy bytes url client_ 
adress 
200 GET 1114 application/octet-stream 
X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www. 
one.example. 
com 
X.X.X.X 
rule "policy-1 MP4" 
attributes 
when 
squid:Squid(dif_MCT=="video",bytes>1000000, 
content_type matches "*.application.*, 
url matches "*.p2p.* ) 
then 
PolicyDecisionPoint.deny(); 
end 10
● An Entry 
○ Has 7 categorical fields and 3 numerical fields. 
● A Rule 
○ Has a set of conditions, and a decision (ALLOW/DENY). 
○ Each condition has three parts: 
■ Data Type (e.g. bytes) 
■ Relationship (e.g. < ) 
■ Value (e.g. 1000000) 
Data description 
11
Tools used during this research 
● Drools and Squid syntax for the rules, CSV format for Log 
data. 
● Weka, which has a great and state-of-the-art set of 
classifiers. 
● Two implementations: 
○ Perl → faster in the parsing process, slower with the labelling 
process and the use of weka. 
○ Java → native implementation with weka, better for automation, 
and it will be embedded in an actual Java project (MUSES). 
12
After the parsing process 
● A hash with the entries 
○ Keys → Entry fields 
○ Values → Field values 
● A hash with the set of rules 
○ Keys → Condition fields, and decision 
○ Values → Name of the data type, its 
desired value, relationship between them, 
and allow, or deny. 
%logdata = ( 
entry =>{ 
http_reply_code =>xxx 
http_method =>xxx 
duration_miliseconds =>xxx 
content_type =>xxx 
server_or_cache_address =>xxx 
time =>xxx 
squid_hierarchy =>xxx 
bytes =>xxx 
url =>xxx 
client_address =>xxx 
}, 
); 
%rules = ( 
rule =>{ 
field =>xxx 
relation =>xxx 
value =>xxx 
decision =>[allow, deny] 
}, 
); 
13
● The two hashes are compared during the labelling process. 
● Conditions of each rule are checked in each entry. 
● If an entry meets all conditions, it is labelled with the 
corresponding decision of the rule. 
● A pair key-value is included in the hash of that entry, with 
the decision. 
● Conflict resolution: 
Labelling Process 
○ Entry meets conditions of a rule that allows making the request. 
○ Entry meets conditions of a rule that denies making the request. 
14
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
Data Summary 
● The CSV file, now with all the patterns that could be labelled 
(the others were not covered by the rules), has 57502 
entries/patterns: 
○ 38972 with an ALLOW label. 
○ 18530 with a DENY label. 
2:1 ratio 
● It might be needed to apply data balancing techniques: 
○ Undersampling: random removal of patterns in majority class. 
○ Oversampling: duplication of each pattern in minority class. 
16
Experimental Setup 
● The classifiers are tested, firstly, with a 10-fold cross-validation 
process. 
○ Top five classifiers in accuracy, are chosen for the following 
experiments. 
○ Also, Naïve Bayes classifier is taking as a reference. 
● Secondly, a division process is performed over the initial 
(labelled) log file, into both training and test files. 
● These training and test files are created with different ratios 
and either taking the entries randomly or sequentially. 
17
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
Flow Diagram 
1) Initial labelling process. 
Experiments with unbalanced, and balanced 
data. From those, divisions are made: 
● 80% training 20% testing 
● 90% training 10% testing 
Randomly, and sequentially. 
3) Enhancing the creation of training and test files. 
Experiments with unbalanced data. From those, divisions 
are made, patterns randomly taken: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
2) Removal of duplicated requests. 
Experiments with unbalanced data. From 
those, divisions are made: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
Randomly, and sequentially. 
4) Filtering the features of the URL. 
Experiments with unbalanced, and balanced data. 
From those, divisions are made, patterns 
randomly taken: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
19
First set of experiments 
1) Initial labelling process. 
● The classifiers are tested, firstly, with a 10-fold cross-validation process 
over the balanced data. 
20
First set of experiments 
1) Initial labelling process. 
● Naïve Bayes and top five classifiers are tested with training and test 
divisions, in order to avoid testing patterns being used for training and 
vice versa. 
21
First set of experiments 
1) Initial labelling process. 
Divisions made over unbalanced data 
22
First set of experiments 
1) Initial labelling process. 
Divisions made over balanced data (undersampling) 
23
First set of experiments 
1) Initial labelling process. 
Divisions made over balanced data (oversampling) 
24
● We studied the field squid_hierarchy and saw that had two possible 
values: DIRECT or DEFAULT_PARENT. 
● The connections are made, firstly, to the Squid proxy, and then, if 
appropriate, the request continues to another server. 
○ Then, some of the entries were repeated, and results may be affected for 
that. 
Second set of experiments 
2) Removal of duplicated requests. 
http_reply_ 
code 
http_metho 
d 
duration_ 
miliseconds 
content_type server_or_ 
cache_address 
time squid_hierarchy bytes url client_ 
adress 
200 GET 1114 application/octet-stream 
X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www. 
one.example. 
com 
X.X.X.X 
25
Second set of experiments 
2) Removal of duplicated requests. 
Divisions made over unbalanced data 
26
Third set of experiments 
3) Enhancing the creation of training and test files. 
● Repeated URL core domains could yield to false results. 
● During the division process, we ensured that requests with the same URL 
core domain went to the same file (either for training or for testing). 
27
Third set of experiments 
3) Enhancing the creation of training and test files. 
28
Created Rules During Classification 
● In the experiments that included only the URL core domain as a 
classification feature, rules were too focused on that feature. 
PART decision list 
------------------ 
url = dropbox: deny (2999.0) 
url = ubuntu: allow (2165.0) 
url = facebook: deny (1808.0) 
url = valli: allow (1679.0) 
29
Created Rules During Classification 
● Another kind of rules were found, but always dependant on 
the URL core domain. 
url = grooveshark AND 
http_method = POST: allow (733.0) 
url = googleapis AND 
content_type = text/javascript AND 
client_address = 192.168.4.4: allow (155.0/2.0) 
url = abc AND 
content_type_MCT = image AND 
time <= 31532000: allow (256.0) 
30
Fourth set of experiments 
4) Filtering the features of the URL. 
● Rules created by the classifiers are too focused on the URL core domain 
feature. 
● We did the experiments again with the original file, but including as a 
feature only the Top Level Domain of the URL, and not the core domain. 
31
Fourth set of experiments 
4) Filtering the features of the URL. 
Divisions made over unbalanced data 
32
Fourth set of experiments 
4) Filtering the features of the URL. 
Divisions made over balanced data 
33
Created Rules During Classification 
● After including the URL top level domain as a classification feature, 
instead of URL core domain, rules classify mainly by server 
address. 
PART decision list 
------------------ 
server_or_cache_address = 173.194.34.248: allow (238.0/1.0) 
server_or_cache_address = 8.27.153.126: allow (235.0/2.0) 
server_or_cache_address = 91.121.155.13: deny (235.0) 
server_or_cache_address = 66.220.152.19: deny (201.0) 
34
Created Rules During Classification 
● URL TLD appears, but now the rules are not always 
dependant on this feature. 
server_or_cache_address = 90.84.53.48 AND 
client_address = 10.159.39.199 AND 
tld = es AND 
time <= 31533000: allow (138.0/1.0) 
content_type = application/octet-stream AND 
tld = com AND 
server_or_cache_address = 192.168.4.4 AND 
client_address = 10.159.86.22: allow (210.0) 
server_or_cache_address = 90.84.53.19 AND 
tld = com: deny (33.0/1.0) 
35
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
● In most cases, Random Forest classifier is the one that yields 
better results. 
● The loss of information when analysing a Log of URL 
requests lowers the results. This happens when: 
○ Oversampling data (because we randomly remove data). 
○ Keeping the sequence of the requests of the initial Log file while 
making the division in training and test files. 
Conclusions 
37
Conclusions 
● For future experiments, it should be ensured that same URL 
lexical features (like the core domain) are not in both 
training and test files at the same time. 
○ This wrongs the results. 
● As seen in the rules obtained, it is possible to develop a tool 
that automatically makes an allowance or denial decision 
with respect to URLs, and that decision would depend on 
other features of a URL request and not only the URL. 
38
Scientific Contributions 
● MUSES: A corporate user-centric system which applies 
computational intelligence methods, at ACM SAC conference, 
Gyeongju, Korea, March 2014. 
● Enforcing Corporate Security Policies via Computational 
Intelligence Techniques, at SecDef Workshop at GECCO, 
Vancouver, July 2014. 
● Going a Step Beyond the Black and White Lists for URL Accesses 
in the Enterprise by means of Categorical Classifiers, at ECTA, 
Rome, Italy, October 2014. 
39
1. Research context. 
2. Underlying problem and objectives. 
3. Data description and preprocessing. 
4. Experimental setup. 
5. Experiments and results. 
6. Conclusions and scientific contributions. 
7. Future Work. 
Index
● Making experiments with bigger data sets (e.g. a whole 
workday). 
● Include more lexical features of a URL in the experiments (e. 
g. number of subdomains, number of arguments, or the 
path). 
● Consider sessions when classifying. 
○ Defining session as the set of requests that are made from a certain 
client during a certain time). 
● To finally implement a system and to prove them with real 
data, in real-time. 
Future Work 
41
Thank you for your attention 
Questions? 
paloma@geneura.ugr.es 
Twitter @unintendedbear

Applying soft computing techniques to corporate mobile security systems

  • 1.
    Applying Soft Computing Techniques to Corporate Mobile Security Systems Máster en Ingeniería de Computadores y Redes Paloma de las Cuevas Delgado Dirigida por los Doctores: Antonio Miguel Mora García Juan Julián Merelo Guervós
  • 2.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 3.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 4.
  • 5.
    Research Context ●Bring Your Own Device problem 2
  • 6.
    Research Context ●MUSES SERVER 3
  • 7.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 8.
    Underlying Problem ●Enterprise Security applied to employees’ connections to the Internet (URL requests). ● Security? How? ○ Proxy ○ Blacklists ○ Whitelists ○ Firewalls ○ Elaboration of Corporate Security Policies List of URLs which are permitted (white) or not (black) ● The aim of this research is going a step beyond. 5
  • 9.
    ● Objective →to obtain a tool for automatically making an allowance or denial decision with respect to URLs that are not included in the black/whitelists. ○ This decision would be based in the one made for similar URL accesses (those with similar features). ○ The tool should consider other parameters of the request in addition to the URL string. Objectives 6
  • 10.
    1. Data Miningprocess a. Parsing b. Preprocessing Followed Schema 2. Labelling process (requests labelled as ALLOW or DENY) 3. Machine Learning 4. Studying classification accuracies 7
  • 11.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 12.
    Working Scenario ●Employees requesting accesses to URLs (records from an actual Spanish company - around 100 employees) during workday. ● Having access to a Log File of 100k entries (patterns) within two hours (8 - 10 am). CSV file format. ● Also, we were provided with a set of rules (specification of the security policies on if-then clauses). 9
  • 13.
    ● An Entry(unlabelled) ● A Policy and a Rule “Video streamings cannot be reproduced” Data description http_reply_ code http_metho d duration_ miliseconds content_type server_or_ cache_address time squid_hierarchy bytes url client_ adress 200 GET 1114 application/octet-stream X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www. one.example. com X.X.X.X rule "policy-1 MP4" attributes when squid:Squid(dif_MCT=="video",bytes>1000000, content_type matches "*.application.*, url matches "*.p2p.* ) then PolicyDecisionPoint.deny(); end 10
  • 14.
    ● An Entry ○ Has 7 categorical fields and 3 numerical fields. ● A Rule ○ Has a set of conditions, and a decision (ALLOW/DENY). ○ Each condition has three parts: ■ Data Type (e.g. bytes) ■ Relationship (e.g. < ) ■ Value (e.g. 1000000) Data description 11
  • 15.
    Tools used duringthis research ● Drools and Squid syntax for the rules, CSV format for Log data. ● Weka, which has a great and state-of-the-art set of classifiers. ● Two implementations: ○ Perl → faster in the parsing process, slower with the labelling process and the use of weka. ○ Java → native implementation with weka, better for automation, and it will be embedded in an actual Java project (MUSES). 12
  • 16.
    After the parsingprocess ● A hash with the entries ○ Keys → Entry fields ○ Values → Field values ● A hash with the set of rules ○ Keys → Condition fields, and decision ○ Values → Name of the data type, its desired value, relationship between them, and allow, or deny. %logdata = ( entry =>{ http_reply_code =>xxx http_method =>xxx duration_miliseconds =>xxx content_type =>xxx server_or_cache_address =>xxx time =>xxx squid_hierarchy =>xxx bytes =>xxx url =>xxx client_address =>xxx }, ); %rules = ( rule =>{ field =>xxx relation =>xxx value =>xxx decision =>[allow, deny] }, ); 13
  • 17.
    ● The twohashes are compared during the labelling process. ● Conditions of each rule are checked in each entry. ● If an entry meets all conditions, it is labelled with the corresponding decision of the rule. ● A pair key-value is included in the hash of that entry, with the decision. ● Conflict resolution: Labelling Process ○ Entry meets conditions of a rule that allows making the request. ○ Entry meets conditions of a rule that denies making the request. 14
  • 18.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 19.
    Data Summary ●The CSV file, now with all the patterns that could be labelled (the others were not covered by the rules), has 57502 entries/patterns: ○ 38972 with an ALLOW label. ○ 18530 with a DENY label. 2:1 ratio ● It might be needed to apply data balancing techniques: ○ Undersampling: random removal of patterns in majority class. ○ Oversampling: duplication of each pattern in minority class. 16
  • 20.
    Experimental Setup ●The classifiers are tested, firstly, with a 10-fold cross-validation process. ○ Top five classifiers in accuracy, are chosen for the following experiments. ○ Also, Naïve Bayes classifier is taking as a reference. ● Secondly, a division process is performed over the initial (labelled) log file, into both training and test files. ● These training and test files are created with different ratios and either taking the entries randomly or sequentially. 17
  • 21.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 22.
    Flow Diagram 1)Initial labelling process. Experiments with unbalanced, and balanced data. From those, divisions are made: ● 80% training 20% testing ● 90% training 10% testing Randomly, and sequentially. 3) Enhancing the creation of training and test files. Experiments with unbalanced data. From those, divisions are made, patterns randomly taken: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing 2) Removal of duplicated requests. Experiments with unbalanced data. From those, divisions are made: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing Randomly, and sequentially. 4) Filtering the features of the URL. Experiments with unbalanced, and balanced data. From those, divisions are made, patterns randomly taken: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing 19
  • 23.
    First set ofexperiments 1) Initial labelling process. ● The classifiers are tested, firstly, with a 10-fold cross-validation process over the balanced data. 20
  • 24.
    First set ofexperiments 1) Initial labelling process. ● Naïve Bayes and top five classifiers are tested with training and test divisions, in order to avoid testing patterns being used for training and vice versa. 21
  • 25.
    First set ofexperiments 1) Initial labelling process. Divisions made over unbalanced data 22
  • 26.
    First set ofexperiments 1) Initial labelling process. Divisions made over balanced data (undersampling) 23
  • 27.
    First set ofexperiments 1) Initial labelling process. Divisions made over balanced data (oversampling) 24
  • 28.
    ● We studiedthe field squid_hierarchy and saw that had two possible values: DIRECT or DEFAULT_PARENT. ● The connections are made, firstly, to the Squid proxy, and then, if appropriate, the request continues to another server. ○ Then, some of the entries were repeated, and results may be affected for that. Second set of experiments 2) Removal of duplicated requests. http_reply_ code http_metho d duration_ miliseconds content_type server_or_ cache_address time squid_hierarchy bytes url client_ adress 200 GET 1114 application/octet-stream X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www. one.example. com X.X.X.X 25
  • 29.
    Second set ofexperiments 2) Removal of duplicated requests. Divisions made over unbalanced data 26
  • 30.
    Third set ofexperiments 3) Enhancing the creation of training and test files. ● Repeated URL core domains could yield to false results. ● During the division process, we ensured that requests with the same URL core domain went to the same file (either for training or for testing). 27
  • 31.
    Third set ofexperiments 3) Enhancing the creation of training and test files. 28
  • 32.
    Created Rules DuringClassification ● In the experiments that included only the URL core domain as a classification feature, rules were too focused on that feature. PART decision list ------------------ url = dropbox: deny (2999.0) url = ubuntu: allow (2165.0) url = facebook: deny (1808.0) url = valli: allow (1679.0) 29
  • 33.
    Created Rules DuringClassification ● Another kind of rules were found, but always dependant on the URL core domain. url = grooveshark AND http_method = POST: allow (733.0) url = googleapis AND content_type = text/javascript AND client_address = 192.168.4.4: allow (155.0/2.0) url = abc AND content_type_MCT = image AND time <= 31532000: allow (256.0) 30
  • 34.
    Fourth set ofexperiments 4) Filtering the features of the URL. ● Rules created by the classifiers are too focused on the URL core domain feature. ● We did the experiments again with the original file, but including as a feature only the Top Level Domain of the URL, and not the core domain. 31
  • 35.
    Fourth set ofexperiments 4) Filtering the features of the URL. Divisions made over unbalanced data 32
  • 36.
    Fourth set ofexperiments 4) Filtering the features of the URL. Divisions made over balanced data 33
  • 37.
    Created Rules DuringClassification ● After including the URL top level domain as a classification feature, instead of URL core domain, rules classify mainly by server address. PART decision list ------------------ server_or_cache_address = 173.194.34.248: allow (238.0/1.0) server_or_cache_address = 8.27.153.126: allow (235.0/2.0) server_or_cache_address = 91.121.155.13: deny (235.0) server_or_cache_address = 66.220.152.19: deny (201.0) 34
  • 38.
    Created Rules DuringClassification ● URL TLD appears, but now the rules are not always dependant on this feature. server_or_cache_address = 90.84.53.48 AND client_address = 10.159.39.199 AND tld = es AND time <= 31533000: allow (138.0/1.0) content_type = application/octet-stream AND tld = com AND server_or_cache_address = 192.168.4.4 AND client_address = 10.159.86.22: allow (210.0) server_or_cache_address = 90.84.53.19 AND tld = com: deny (33.0/1.0) 35
  • 39.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 40.
    ● In mostcases, Random Forest classifier is the one that yields better results. ● The loss of information when analysing a Log of URL requests lowers the results. This happens when: ○ Oversampling data (because we randomly remove data). ○ Keeping the sequence of the requests of the initial Log file while making the division in training and test files. Conclusions 37
  • 41.
    Conclusions ● Forfuture experiments, it should be ensured that same URL lexical features (like the core domain) are not in both training and test files at the same time. ○ This wrongs the results. ● As seen in the rules obtained, it is possible to develop a tool that automatically makes an allowance or denial decision with respect to URLs, and that decision would depend on other features of a URL request and not only the URL. 38
  • 42.
    Scientific Contributions ●MUSES: A corporate user-centric system which applies computational intelligence methods, at ACM SAC conference, Gyeongju, Korea, March 2014. ● Enforcing Corporate Security Policies via Computational Intelligence Techniques, at SecDef Workshop at GECCO, Vancouver, July 2014. ● Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers, at ECTA, Rome, Italy, October 2014. 39
  • 43.
    1. Research context. 2. Underlying problem and objectives. 3. Data description and preprocessing. 4. Experimental setup. 5. Experiments and results. 6. Conclusions and scientific contributions. 7. Future Work. Index
  • 44.
    ● Making experimentswith bigger data sets (e.g. a whole workday). ● Include more lexical features of a URL in the experiments (e. g. number of subdomains, number of arguments, or the path). ● Consider sessions when classifying. ○ Defining session as the set of requests that are made from a certain client during a certain time). ● To finally implement a system and to prove them with real data, in real-time. Future Work 41
  • 45.
    Thank you foryour attention Questions? paloma@geneura.ugr.es Twitter @unintendedbear