SlideShare a Scribd company logo
1 of 35
Going a Step Beyond the Black 
and White Lists for URL Accesses 
in the Enterprise by means of 
Categorical Classifiers 
Authors: 
Antonio Miguel Mora García 
Paloma de las Cuevas Delgado 
Juan Julián Merelo Guervós 
ECTA 2014, Rome, Italy
MUSES is an EU funded research project 
1
Bring Your Own Device 
What happens to corporate assets in a BYOD 
environment? 
2
Structure of the MUSES server 
3
Underlying Problem 
Enterprise Security applied to employees’ connections to the 
Internet (URL requests). 
4 
www 
● Proxies 
● Firewalls 
● Corporate Security Policies (CSP) which may 
include Blacklists and Whitelists
What do Black and White lists cover? 
● Every URL inside a Blacklist is denied, if not, it is allowed. 
What if something is directly allowed but it should not be? 
● Every URL inside a Whitelist is allowed, if not, it is denied. 
What if something is directly denied but it should not be? 
Therefore, we want to go a step beyond. 
5
● Objective → to obtain a tool for automatically making an 
allowance or denial decision with respect to URLs that are 
not included in the black/whitelists. 
o This decision would be based in the one made for similar URL 
accesses (those with similar features). 
o The tool should consider other parameters of the request in 
addition to the URL string. 
Objectives 
6
Followed Schema 
Unlabelled Labelled requests 
requests 
Classification 
accuracies and Rules Classification 
methods 
7 
Data Mining Labelling Process 
Analysis of results Machine Learning
Working Scenario 
Employees requesting accesses to URLs (records from an actual 
Spanish company - around 100 employees) from 8 to 10 am. 
8 
www 
● Log File of 100k entries (patterns). CSV file format. 
● A set of rules (specification of the security policies 
on if-then clauses).
Data description: Entries in the Log 
● An Entry (unlabelled) 
● It has 7 categorical fields and 3 numerical fields. 
● Leads to classification which support both types: 
o Rule based classifiers 
o Tree based classifiers 
http_reply_ 
code 
http_metho 
d 
duration_ 
miliseconds 
content_type server_or_ 
cache_address 
time squid_hierarchy bytes url client_ 
adress 
200 GET 1114 application/octet-stream 
X.X.X.X 08:30:0 
8 
DEFAULT_PARENT 10696 
1 
http://www.on 
e.example.com 
X.X.X.X 
9
Data description: Policies and Rules 
● A Policy and a Rule 
“Video streamings cannot be reproduced” 
rule "policy-1 MP4" 
attributes 
when 
squid:Squid(dif_MCT=="video",bytes>1000000, 
content_type matches "*.application.*, 
url matches "*.p2p.* ) 
then 
PolicyDecisionPoint.deny(); 
end 
● It has a set of conditions, and a decision (ALLOW/DENY). 
● Each condition has: Data Type, Relationship, Value. 
10
Labelling Process 
● The two data sets are compared during the labelling process. 
● Conditions of each rule are checked in each entry/request. 
● If an entry meets all conditions, it is labelled with the 
corresponding decision of the rule. 
When 
- Entry meets conditions of a rule that allows making the request. 
AND - Entry meets conditions of a rule that denies making the request. 
THEN - DENY is chosen. 
11
Data Summary 
● The CSV file, now with all the patterns that could be labelled 
(the others were not covered by the rules), has 57502 
entries/patterns: 
o 38972 with an ALLOW label. 
o 18530 with a DENY label. 
2:1 ratio 
● Application of data balancing techniques: 
o Undersampling: random removal of patterns in majority class. 
o Oversampling: duplication of each pattern in minority class. 
12
Experimental Setup 
● The classifiers are tested, firstly, with a 10-fold cross-validation 
process. 
o Top five classifiers in accuracy, are chosen for the following 
experiments. 
o Also, Naïve Bayes classifier is taking as a reference. 
● Secondly, a division process is performed over the initial 
(labelled) log file, into both training and test files. 
● These training and test files are created with different ratios 
and either taking the entries randomly or sequentially. 
13
Flow Diagram 
1) Initial labelling process. 
Experiments with unbalanced, and balanced 
data. From those, divisions are made: 
● 80% training 20% testing 
● 90% training 10% testing 
Randomly, and sequentially. 
3) Enhancing the creation of training and test files. 
Experiments with unbalanced data. From those, divisions 
are made, patterns randomly taken: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
2) Removal of duplicated requests. 
Experiments with unbalanced data. From 
those, divisions are made: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
Randomly, and sequentially. 
4) Filtering the features of the URL. 
Experiments with unbalanced, and balanced data. 
From those, divisions are made, patterns 
randomly taken: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
14
10-fold cross-validation experiments 
1) Initial labelling process. 
● The classifiers are tested, firstly, with a 10-fold cross-validation process 
over the balanced data. 
15
Using separate training/test files 
1) Initial labelling process. 
● Naïve Bayes and top five classifiers are tested with training and test 
divisions, in order to avoid testing patterns being used for training and 
vice versa. 
16
Serendipity rocks 
1) Initial labelling process. 
Divisions made over unbalanced data 
17
Results continue falling 
1) Initial labelling process. 
Divisions made over balanced data (undersampling) 
18
Results continue falling 
1) Initial labelling process. 
Divisions made over balanced data (oversampling) 
19
Why are accuracies still high? 
2) Removal of duplicated requests. 
● We studied the field squid_hierarchy and saw that had two possible 
values: DIRECT or DEFAULT_PARENT. 
http_reply_ 
code 
http_metho 
d 
duration_ 
miliseconds 
content_type server_or_ 
cache_address 
time squid_hierarchy bytes url client_ 
adress 
200 GET 1114 application/octet-stream 
X.X.X.X 08:30:0 
8 
DEFAULT_PARENT 10696 
1 
http://www.on 
e.example.com 
X.X.X.X 
20
Repeated entries affect accuracies 
2) Removal of duplicated requests. 
● The connections are made, firstly, to the Squid proxy, and then, if 
appropriate, the request continues to another server. 
o Then, some of the entries were repeated, and results may be affected for 
that. 
21 
www 
“Some local IP” 192.194.2.2 “Some server IP”
Serendipity rocks again 
2) Removal of duplicated requests. 
Divisions made over unbalanced data 
22
Where are the URL features going? 
3) Enhancing the creation of training and test files. 
● Repeated URL core domains could yield to false results. 
● During the division process, we ensured that requests with the same 
URL core domain went to the same file (either for training or for 
testing). 
23
Accuracies fall down automatically 
3) Enhancing the creation of training and test files. 
24
Created Rules During Classification 
● In the experiments that included only the URL core domain as a 
classification feature, rules were too focused on that feature. 
PART decision list 
------------------ 
url = dropbox: deny (2999.0) 
url = ubuntu: allow (2165.0) 
url = facebook: deny (1808.0) 
url = valli: allow (1679.0) 
25
Created Rules During Classification 
● Another kind of rules were found, but always dependant on 
the URL core domain. 
url = grooveshark AND 
http_method = POST: allow (733.0) 
url = googleapis AND 
content_type = text/javascript AND 
client_address = 192.168.4.4: allow (155.0/2.0) 
url = abc AND 
content_type_MCT = image AND 
time <= 31532000: allow (256.0) 
26
Training with other URL features 
4) Filtering the features of the URL. 
● Rules created by the classifiers are too focused on the URL core domain 
feature. 
● We did the experiments again with the original file, but including as a 
feature only the Top Level Domain of the URL, and not the core domain. 
27
Random Forest defeats everyone 
4) Filtering the features of the URL. 
Divisions made over balanced data 
28
Created Rules During Classification 
● After including the URL top level domain as a classification feature, 
instead of URL core domain, rules classify mainly by server 
address. 
PART decision list 
------------------ 
server_or_cache_address = 173.194.34.248: allow (238.0/1.0) 
server_or_cache_address = 91.121.155.13: deny (235.0) 
server_or_cache_address = 90.84.53.48 AND 
client_address = 10.159.39.199 AND 
tld = es AND 
time <= 31533000: allow (138.0/1.0) 
29
Created Rules During Classification 
● URL TLD appears, but now the rules are not always 
dependant on this feature. 
server_or_cache_address = 90.84.53.19 AND 
tld = com: deny (33.0/1.0) 
server_or_cache_address = 87.248.20.254 AND 
content_type_MCT = image AND 
duration_milliseconds > 21: deny (15.0) 
server_or_cache_address = 23.38.17.224 AND 
time > 30532000 AND 
http_reply_code = 200 AND 
content_type_MCT = image AND 
bytes <= 520 AND 
time <= 33677000: allow (40.0) 
30
● In most cases, Random Forest classifier is the one that yields 
better results. 
● The loss of information when analysing a Log of URL 
requests lowers the results. This happens when: 
o Oversampling data (because we randomly remove data). 
o Keeping the sequence of the requests of the initial Log file while 
making the division in training and test files. 
Conclusions 
31
Conclusions 
● As seen in the rules obtained, it is possible to develop a tool 
that automatically makes an allowance or denial decision 
with respect to URLs, and that decision would depend on 
other features of a URL request and not only the URL. 
33
● Making experiments with bigger data sets (e.g. a whole 
workday). 
● Include more lexical features of a URL in the experiments 
(e.g. number of subdomains, number of arguments, or the 
path). 
● Consider sessions when classifying. 
o Defining session as the set of requests that are made from a certain 
client during a certain time). 
● To finally implement a system and to prove them with real 
data, in real-time. 
Future Work 
34
Thank you for your attention 
Questions? 
amorag@geneura.ugr.es 
jmerelo@geneura.ugr.es 
paloma@geneura.ugr.es 
Twitter (@amoragar, @jjmerelo, 
@unintendedbear)

More Related Content

Viewers also liked

MyBatis como alternativa a Hibernate
MyBatis como alternativa a HibernateMyBatis como alternativa a Hibernate
MyBatis como alternativa a HibernateRubén Aguilera
 
Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)Juanma Gómez
 
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?Antonio Ognio
 
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)Victor Pezzetti
 
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technologyNoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technologyDataGeekery
 
Stateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applicationsStateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applicationsAlvaro Sanchez-Mariscal
 
Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)Juanma Gómez
 
#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScript#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScriptCarlos Azaustre
 
Game of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCCGame of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCCCarlos Azaustre
 

Viewers also liked (12)

MyBatis como alternativa a Hibernate
MyBatis como alternativa a HibernateMyBatis como alternativa a Hibernate
MyBatis como alternativa a Hibernate
 
Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)
 
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
 
Erlang y elixir
Erlang y elixirErlang y elixir
Erlang y elixir
 
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
 
Delegation
DelegationDelegation
Delegation
 
Interface
InterfaceInterface
Interface
 
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technologyNoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
 
Stateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applicationsStateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applications
 
Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)
 
#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScript#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScript
 
Game of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCCGame of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCC
 

Similar to Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorialSrinath Perera
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationMongoDB
 
A Test Automation Framework
A Test Automation FrameworkA Test Automation Framework
A Test Automation FrameworkGregory Solovey
 
Qtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutionsQtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutionsQUONTRASOLUTIONS
 
Secrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsSecrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsTarik Essawi
 
How Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming SkillsHow Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming SkillsRanorex
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingSrinath Perera
 
Test Driven Development with Sql Server
Test Driven Development with Sql ServerTest Driven Development with Sql Server
Test Driven Development with Sql ServerDavid P. Moore
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databaseselliando dias
 
Model Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition SystemsModel Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition SystemsLionel Briand
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Informatik Aktuell
 
Testing insights from data lakes
Testing insights from data lakesTesting insights from data lakes
Testing insights from data lakesshivindkaur
 
Postgresql in Education
Postgresql in EducationPostgresql in Education
Postgresql in Educationdostatni
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youJacek Gebal
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and OptimizationMongoDB
 
EKON 23 Code_review_checklist
EKON 23 Code_review_checklistEKON 23 Code_review_checklist
EKON 23 Code_review_checklistMax Kleiner
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...
The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...TEST Huddle
 

Similar to Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers (20)

Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorial
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
A Test Automation Framework
A Test Automation FrameworkA Test Automation Framework
A Test Automation Framework
 
Qtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutionsQtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutions
 
Secrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsSecrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archs
 
How Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming SkillsHow Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming Skills
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 Srping
 
Test Driven Development with Sql Server
Test Driven Development with Sql ServerTest Driven Development with Sql Server
Test Driven Development with Sql Server
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databases
 
Model Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition SystemsModel Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition Systems
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
 
Testing insights from data lakes
Testing insights from data lakesTesting insights from data lakes
Testing insights from data lakes
 
Postgresql in Education
Postgresql in EducationPostgresql in Education
Postgresql in Education
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love you
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and Optimization
 
EKON 23 Code_review_checklist
EKON 23 Code_review_checklistEKON 23 Code_review_checklist
EKON 23 Code_review_checklist
 
HW03 (1).pdf
HW03 (1).pdfHW03 (1).pdf
HW03 (1).pdf
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...
The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...
 

Recently uploaded

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Recently uploaded (20)

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

  • 1. Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers Authors: Antonio Miguel Mora García Paloma de las Cuevas Delgado Juan Julián Merelo Guervós ECTA 2014, Rome, Italy
  • 2. MUSES is an EU funded research project 1
  • 3. Bring Your Own Device What happens to corporate assets in a BYOD environment? 2
  • 4. Structure of the MUSES server 3
  • 5. Underlying Problem Enterprise Security applied to employees’ connections to the Internet (URL requests). 4 www ● Proxies ● Firewalls ● Corporate Security Policies (CSP) which may include Blacklists and Whitelists
  • 6. What do Black and White lists cover? ● Every URL inside a Blacklist is denied, if not, it is allowed. What if something is directly allowed but it should not be? ● Every URL inside a Whitelist is allowed, if not, it is denied. What if something is directly denied but it should not be? Therefore, we want to go a step beyond. 5
  • 7. ● Objective → to obtain a tool for automatically making an allowance or denial decision with respect to URLs that are not included in the black/whitelists. o This decision would be based in the one made for similar URL accesses (those with similar features). o The tool should consider other parameters of the request in addition to the URL string. Objectives 6
  • 8. Followed Schema Unlabelled Labelled requests requests Classification accuracies and Rules Classification methods 7 Data Mining Labelling Process Analysis of results Machine Learning
  • 9. Working Scenario Employees requesting accesses to URLs (records from an actual Spanish company - around 100 employees) from 8 to 10 am. 8 www ● Log File of 100k entries (patterns). CSV file format. ● A set of rules (specification of the security policies on if-then clauses).
  • 10. Data description: Entries in the Log ● An Entry (unlabelled) ● It has 7 categorical fields and 3 numerical fields. ● Leads to classification which support both types: o Rule based classifiers o Tree based classifiers http_reply_ code http_metho d duration_ miliseconds content_type server_or_ cache_address time squid_hierarchy bytes url client_ adress 200 GET 1114 application/octet-stream X.X.X.X 08:30:0 8 DEFAULT_PARENT 10696 1 http://www.on e.example.com X.X.X.X 9
  • 11. Data description: Policies and Rules ● A Policy and a Rule “Video streamings cannot be reproduced” rule "policy-1 MP4" attributes when squid:Squid(dif_MCT=="video",bytes>1000000, content_type matches "*.application.*, url matches "*.p2p.* ) then PolicyDecisionPoint.deny(); end ● It has a set of conditions, and a decision (ALLOW/DENY). ● Each condition has: Data Type, Relationship, Value. 10
  • 12. Labelling Process ● The two data sets are compared during the labelling process. ● Conditions of each rule are checked in each entry/request. ● If an entry meets all conditions, it is labelled with the corresponding decision of the rule. When - Entry meets conditions of a rule that allows making the request. AND - Entry meets conditions of a rule that denies making the request. THEN - DENY is chosen. 11
  • 13. Data Summary ● The CSV file, now with all the patterns that could be labelled (the others were not covered by the rules), has 57502 entries/patterns: o 38972 with an ALLOW label. o 18530 with a DENY label. 2:1 ratio ● Application of data balancing techniques: o Undersampling: random removal of patterns in majority class. o Oversampling: duplication of each pattern in minority class. 12
  • 14. Experimental Setup ● The classifiers are tested, firstly, with a 10-fold cross-validation process. o Top five classifiers in accuracy, are chosen for the following experiments. o Also, Naïve Bayes classifier is taking as a reference. ● Secondly, a division process is performed over the initial (labelled) log file, into both training and test files. ● These training and test files are created with different ratios and either taking the entries randomly or sequentially. 13
  • 15. Flow Diagram 1) Initial labelling process. Experiments with unbalanced, and balanced data. From those, divisions are made: ● 80% training 20% testing ● 90% training 10% testing Randomly, and sequentially. 3) Enhancing the creation of training and test files. Experiments with unbalanced data. From those, divisions are made, patterns randomly taken: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing 2) Removal of duplicated requests. Experiments with unbalanced data. From those, divisions are made: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing Randomly, and sequentially. 4) Filtering the features of the URL. Experiments with unbalanced, and balanced data. From those, divisions are made, patterns randomly taken: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing 14
  • 16. 10-fold cross-validation experiments 1) Initial labelling process. ● The classifiers are tested, firstly, with a 10-fold cross-validation process over the balanced data. 15
  • 17. Using separate training/test files 1) Initial labelling process. ● Naïve Bayes and top five classifiers are tested with training and test divisions, in order to avoid testing patterns being used for training and vice versa. 16
  • 18. Serendipity rocks 1) Initial labelling process. Divisions made over unbalanced data 17
  • 19. Results continue falling 1) Initial labelling process. Divisions made over balanced data (undersampling) 18
  • 20. Results continue falling 1) Initial labelling process. Divisions made over balanced data (oversampling) 19
  • 21. Why are accuracies still high? 2) Removal of duplicated requests. ● We studied the field squid_hierarchy and saw that had two possible values: DIRECT or DEFAULT_PARENT. http_reply_ code http_metho d duration_ miliseconds content_type server_or_ cache_address time squid_hierarchy bytes url client_ adress 200 GET 1114 application/octet-stream X.X.X.X 08:30:0 8 DEFAULT_PARENT 10696 1 http://www.on e.example.com X.X.X.X 20
  • 22. Repeated entries affect accuracies 2) Removal of duplicated requests. ● The connections are made, firstly, to the Squid proxy, and then, if appropriate, the request continues to another server. o Then, some of the entries were repeated, and results may be affected for that. 21 www “Some local IP” 192.194.2.2 “Some server IP”
  • 23. Serendipity rocks again 2) Removal of duplicated requests. Divisions made over unbalanced data 22
  • 24. Where are the URL features going? 3) Enhancing the creation of training and test files. ● Repeated URL core domains could yield to false results. ● During the division process, we ensured that requests with the same URL core domain went to the same file (either for training or for testing). 23
  • 25. Accuracies fall down automatically 3) Enhancing the creation of training and test files. 24
  • 26. Created Rules During Classification ● In the experiments that included only the URL core domain as a classification feature, rules were too focused on that feature. PART decision list ------------------ url = dropbox: deny (2999.0) url = ubuntu: allow (2165.0) url = facebook: deny (1808.0) url = valli: allow (1679.0) 25
  • 27. Created Rules During Classification ● Another kind of rules were found, but always dependant on the URL core domain. url = grooveshark AND http_method = POST: allow (733.0) url = googleapis AND content_type = text/javascript AND client_address = 192.168.4.4: allow (155.0/2.0) url = abc AND content_type_MCT = image AND time <= 31532000: allow (256.0) 26
  • 28. Training with other URL features 4) Filtering the features of the URL. ● Rules created by the classifiers are too focused on the URL core domain feature. ● We did the experiments again with the original file, but including as a feature only the Top Level Domain of the URL, and not the core domain. 27
  • 29. Random Forest defeats everyone 4) Filtering the features of the URL. Divisions made over balanced data 28
  • 30. Created Rules During Classification ● After including the URL top level domain as a classification feature, instead of URL core domain, rules classify mainly by server address. PART decision list ------------------ server_or_cache_address = 173.194.34.248: allow (238.0/1.0) server_or_cache_address = 91.121.155.13: deny (235.0) server_or_cache_address = 90.84.53.48 AND client_address = 10.159.39.199 AND tld = es AND time <= 31533000: allow (138.0/1.0) 29
  • 31. Created Rules During Classification ● URL TLD appears, but now the rules are not always dependant on this feature. server_or_cache_address = 90.84.53.19 AND tld = com: deny (33.0/1.0) server_or_cache_address = 87.248.20.254 AND content_type_MCT = image AND duration_milliseconds > 21: deny (15.0) server_or_cache_address = 23.38.17.224 AND time > 30532000 AND http_reply_code = 200 AND content_type_MCT = image AND bytes <= 520 AND time <= 33677000: allow (40.0) 30
  • 32. ● In most cases, Random Forest classifier is the one that yields better results. ● The loss of information when analysing a Log of URL requests lowers the results. This happens when: o Oversampling data (because we randomly remove data). o Keeping the sequence of the requests of the initial Log file while making the division in training and test files. Conclusions 31
  • 33. Conclusions ● As seen in the rules obtained, it is possible to develop a tool that automatically makes an allowance or denial decision with respect to URLs, and that decision would depend on other features of a URL request and not only the URL. 33
  • 34. ● Making experiments with bigger data sets (e.g. a whole workday). ● Include more lexical features of a URL in the experiments (e.g. number of subdomains, number of arguments, or the path). ● Consider sessions when classifying. o Defining session as the set of requests that are made from a certain client during a certain time). ● To finally implement a system and to prove them with real data, in real-time. Future Work 34
  • 35. Thank you for your attention Questions? amorag@geneura.ugr.es jmerelo@geneura.ugr.es paloma@geneura.ugr.es Twitter (@amoragar, @jjmerelo, @unintendedbear)