1. Machine Learning Approach in
Web Proxy Cache
Replacement.
Sivaraj Nimishan
2011/CSC/016
Superviser
Sriskandarajah Shriparen
2. Web Proxy Caching
• Solution for improving the performance of Web-based systems is Web proxy caching
3. Cache Replacements
• In the proxy cache replacement, the proxy cache must
effectively decide which objects are worth caching or
replacing with other objects.
LRU
LFU
LFU-DA
GDSF
The least recently used objects are removed first.
Dynamic aging factor is incorporated into LFU.
Size, Cost of fetching, Dynamic aging factor integrated with
frequency
The least frequently used objects are removed first.
4. Squid
Squid log format
LRU : The LRU policies keeps recently referenced objects.
heap GDSF : The heap GDSF policy optimizes object hit rate by keeping smaller popular
objects in cache
heap LFUDA : The heap LFUDA policy keeps popular objects in cache regardless of their size
heap LRU : LRU policy implemented using a heap
timestamp
response time
client address
status codes
size
request method
URL client identity
Hierarchy Code
content type
6. Data collection
Billion Triples Challenge 2012 Dataset
The dataset was crawled during May/June 2012. Several seed sets collected from mulitple sources.
Datahub A Data Ecosystem for Individuals, Teams and People
DBpedia
DBpedia is a crowd-sourced community effort to extract structured information from
Wikipedia and make this information available on the Web.
Freebase A community-curated database of well-known people, places, and things
Rest
The seed set for the Rest crawl contained all other URIs involved in a relation in the
DBpedia
Timbl
Timbl crawl consisted of Tim Berners-Lee's Friend of a Friend (FOAF)project.
(2 files)
7. Preprocessing
Data Set Size from to
Datahub 136.8MB [Thu Apr 26 20:07:13 2012] [Fri Apr 27 16:20:16 2012]
DBpedia 170.3MB [Tue May 1 07:46:29 2012] [Fri Apr 27 21:19:02 2012]
Freebase 123.6MB [Fri Apr 27 07:18:03 2012] [Mon Apr 30 12:31:49 2012]
Rest 32MB [Mon Apr 30 13:34:06 2012] [Mon Apr 30 18:46:04 2012]
Timbl 1 138.5MB [Sat May 5 21:05:02 2012] [Tue May 8 07:50:56 2012]
Timbl 2 179.5MB [Tue May 15 20:29:22 2012] [Wed May 23 04:53:27 2012]
9. Preprocessing...
SWL Sliding Window Length of 30 minutes-( Romano and ElAarag)
Target attribute is obtained by backward-looking sliding window
1 ; if the object is revisited within the sliding window
Target attribute =
0 ; otherwise
Attributes Values
time 1335442301
duration 379
client 127.0.0.1
result_code TCP_MISS/200
size 1609
method GET
URL http://www.opencalais.com/robots.txt
{
10. a perl command used to convert the unix time-stamp to human-readable timestamp
tail access.log | perl -p -e 's/^([0-9]*)/"[".localtime($1)."]"/e'
Preprocessing...
13. Performance Measure
Hit Ratio is the factor widely used in evaluating the
performance of web caching
i.e, Hit Ratio is defined as the percentage of requests
that can be satisfied by the cache.
Hit Ratio = * 100
Hit Ratio
Cacheable requests
14. Machine Learner
WSO2 Machine Learner is a product which helps
to manage and explore the data, build machine
learning models after analyzing the data using
machine learning algorithms, compare and manage
generated machine learning models and predict using
the built models.
Apache Spark is a fast and general engine for large-scale
data processing.
Easy graphical user interface for human-friendly viewing
15. Access the ML UI from a Web browser using the following URL: https://<ML_HOST>:<ML_PORT>/ml
to run ML : <PRODUCT_HOME>/bin/wso2server.sh
SVM Decision TreeParameters
100 : Iterations
0.001 : Learning Rate
1 : SGD Data Fraction
L1 : Reg Type
0.001 : Reg Parameter
Parameters
Max Depth : 30
Max Bins : depend on unique features
Impurity : gini/entropy
16.
17. Data set Total
requests
Number of
hits
Hit ratio
Datahub2 54557 45357 83.13
Dbpedia 181114 105883 58.46
Freebase 43507 32527 74.76
Rest 5685 4428 77.88
Timbl 97039 42390 43.68
Timbl2 206708 135149 66.15
Data set Total
requests
Number of
hits
Hit ratio
Datahub2 54557 25470 46.68
Dbpedia 181114 118418 65.38
Freebase 43507 26359 60.58
Rest 5685 1519 26.71
Timbl 97039 58243 60.02
Timbl2 204288 96822 47.39
18. Conclusion
Data Set Requests Cacheable
requests
Hit Ratio(%)
Datahub 398547 181850 83.13
DBpedia 1382090 537038 65.38
Freebase 333956 145010 74.76
Rest 71972 18942 77.88
Timbl 1 889591 323451 60.02
Timbl 2 1675106 680952 66.15
In this study SVM and Decision
Tree approches were used to train
proxy logs files to classify the
contents of Web proxy cache.
The hit ratio calculated by the
classification decisions made by
the trained SVM and trained
Decision tree
The performance of Web caching
can be improved using supervised
machine learning.
Classifiers can be utilized to improve the hit ratio of traditional Web caching policies.
19. References
S. Romano and H. ElAarag, "A neural network proxy cache replacement strategy and its implementation in
the Squid proxy server", Neural Computing & Applications, Vol. 20, No. 1, (2011), pp. 59-78.
A. I. Vakali, "LRU-based algorithms for Web Cache Replacement"
W. Ali S. Sulaiman, and N. Ahmad "Performance Improvement of Least-Recently Used Policy in Web Proxy Cache
Replacement Using Supervised Machine Learning" Int. J. Advance. Soft Comput. Appl., Vol. 6, No.1 ,(2014)
Introducing Machine Learner https://docs.wso2.com/display/ML100/Introducing+Machine+Learner
Squid: Optimising Web Delivery http://www.squid-cache.org/
Editor's Notes
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response.
seed data is a collection of information that is used as training, testing, or as a template
S. Romano and H. ElAarag, "A neural network proxy cache replacement strategy and its implementation in the Squid proxy server" Vol. 20, No. 1, (2011), pp. 59-78.
30 mins : can increase the training performance when using large training datasets.
The idea is to use information about a Web object requested in the past to predict revisiting of suchWeb object within the sliding window.
High Write Load
MongoDB by default prefers high insert rate
handle highly diverse data types, and manage applications more efficiently
representational state transfer (REST) is the software architectural style of the World Wide Web.
DAS: Data Analytics Server
Stochastic gradient descent : simplest method to solve optimization problems i.e, optimization method for minimizing an objective function
a step size in GD
“Gini” to minimize misclassification
“Entropy” for exploratory analysis
these differ less than 2% of the time
Gini”will tend to find the largest class, and “entropy” tends to find groups of classes that make up ~50% of the data