The document proposes building an anomaly detection system for ecommerce sites hosted on a LAMP server. It describes three techniques for the system: 1) Monitoring log file sizes daily and detecting deviations from normal size thresholds, 2) Classifying log entries as normal or abnormal user activity, and 3) Building a Markov model of historical log file sizes to predict expected sizes and detect anomalies. The goal is to use simple methods to detect anomalous activities on the LAMP server that could indicate security issues.
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Â
Detect Anomalies on LAMP Servers
1. Nathanael Asaam
Founder and CEO @ Equicksales Consulting Ltd | Application Support Officer @ Ashesi
University
nataoasaam@gmail.com
An Anomaly Detection System for Ecommerce Sites Hosted on a LAMP Server
Introduction
Anomaly detection uses deviations from novel patterns of a system to detect intrusions and threats on a
computer system. In this paper, we describe how to build an anomaly detection system for ecommerce sites
that are hosted on a LAMP server. LAMP is an acronym that stands for Linux, Apache, Mysql and PHP.
The motivation for this research paper is that LAMP servers normally come with application logs for
the various software that make up the LAMP stack. Thus, it is easy to build a machine learning model that
describes the behaviour of the various software that make up the LAMP stack or a rule-based model for
anomaly detection. This is partly because the training data can easily be obtained from the logs for apache,
mysql php, and any linux distribution. Additionally, the logs are updated as and when there are interactions
with the various softwares. There are also error logs for some errors that occurred in these softwares.
In this project, we will describe how to build a model for a LAMP Server that uses Ubuntu Server and use it to
detect anomalous activities on a web application or ecommerce site that is hosted on that LAMP server. It is
essential to note that Ubuntu Server is free and open source, and so is Apache, Mysql and PHP.
Background
In this section we give a concise background of Ubuntu Server, Apache, Mysql and PHP.
Ubuntu Server
Ubuntu Server is a version of the Ubuntu Operating System that is designed and engineered as a backbone for
the Internet [2]. Ubuntu Server brings economic and technical scalability to your datacentre, public or private
[2]. Whether you want to deploy an OpenStack, a Kubernetes cluster, or a 50,000 - node render farm, it
delivers the best value scale-out performance available[2].
Apache
Apache also known as Apache HTTP Server is the most widely used web server software and runs on 67% of all
websites in the world [1]. It is developed and maintained by Apache Software Foundation [1]. It is fast, reliable
and secure and can be highly customized to meet the needs of many different environments using extensions
and modules [1].
Mysql
Mysql is an open-source Relational Database Management System (RDBMS) that enables users to
store,manage and retrieve structured data efficiently [4]. It is widely used for various applications from
small-scale projects to large-scales websites and enterprise level solutions [4]. It is also the most popular
open-source SQL DBMS and is developed, supported and distributed by Oracle Corporation [5].
PHP
PHP (a recursive acronym for PHP Hypertext Preprocessor) is a widely used general-purpose scripting language
that is well suited for web development and can be embedded in HTML [6]. According to Web Technology
1
2. 2
Surveys, PHP is used by 78.1% of all websites including high-traffic websites such as Facebook and Wikipedia
[7].
Previous Work
In this section we describe several works related to vulnerability scan detection and detecting intrusions or
attacks by analyzing logs of Apache Http Server.
Rule Based Model for Analyzing Http Access Logs and Detecting Web Scans, SQL Injection (SQLI) and
Cross-Site Scripting (XSS)
A research paper on using a rule-based model to detect anomaly by analyzing Http Server Access Logs and Web
Scans explain that, according to the European Network and Information Security Agency (ENISA) Threat
Landscape, Web based and Web Application attacks are ranked as number two and three in Cyber Security
Environment [3]. These rankings remain unchanged between 2014 and 2015 [3]. Thus, Web Applications are
more prone to Security Risks [3].
The research paper states that Cross-Site Scripting (XSS) and Structure Query Language Injection
(SQLI) seem to be at a decreasing rate in 2014 but increased in 2015 [3]. The paper went further to state that,
to detect all the mentioned attacks and web scans, analyzing log files are preferred due to the fact that
anomalies in users’ request and related server response could be clearly identified [3]. Also, it must be stated
that two primary reasons why analyzing log files is preferred are that there is no need for expensive hardware
for the analysis and also log files provide successful detection especially for encrypted protocols like Secured
Socket Layer (SSL) and Secured Shell Daemon (SSHD) [3]. However, the paper noted that, the heavier the
website traffic the more difficult the analysis of the log file and this presents the need for a user-friendly web
vulnerability scanner detection tool for analyzing log files [3].
Also, the motivation for this research paper is that, work in this field uses a different approach, which
is machine learning and data mining based predictive detection of malicious activities [3]. Additionally, in order
to increase the accuracy of a machine learning classifier, a large-scale input training data is needed which in
turns leads to increase in memory usage [3]. Another negative point about machine learning based approaches
is overfitting; referring to a model that models the training data too well resulting in the models negative
predictive performance and low generalization ability [3].
Finally, the proposed model of this research paper has three significant assumptions. These are;
1. In access logs POST data cannot be logged. Subsequently, the proposed method cannot capture this
sort of data [3].
2. Browsers or Web Server may support other encodings. Since only two are in the context of the
research paper, the script does not capture data encoded in other styles.
3. The proposed model is for detection of two well-known web application attacks and malicious web
vulnerability scans. Thus the model is not for prevention and working online mode is not included in the
research paper.
Classification of Malicious Cyber Activities and Attacks and Vulnerability Scans
A research paper on classification of malicious web sessions states that SANs reported that 60% of total attack
attempts observed on the Internet were against Web Applications [8]. The paper further states that recently,
the long tradition and great success of characterization of network traffic and server workload is not the focus
2
3. 3
of research [8]. Also, not much focus is placed on quantification of malicious attacker behaviour [8]. The one
evident reason for this is the lack of publicly available, good quality data on cyber security threats and
malicious attacker activities [8].
The paper explains that, although there is a significant amount of research in intrusion detection, the
focus is on developing data mining techniques aimed at constructing a black-box that classifies network traffic
on malicious and non-malicious activities rather than the discovery of the nature of malicious activities [8].
Additionally, significant amount of intrusion detection research works were based on outdated data sets such
as the DARPA Intrusion Detection Data Set and its derivative KDD[8]. Motivated by the lack of available data
sets that incorporated attacker activities, the researchers developed and deployed high interaction honeypots
as a means to collect such data [8]. Their honeypots were configured in a three-tier architecture (consisting of
frontend web server, application server and backend database) and had meaningful functionalities [8].
Furthermore, they ran standard off the shelf operating systems and applications which followed typical
security guidelines and did not include user accounts with nil or weak passwords [8]. The data collected by the
honeypots are grouped into four datasets each with a duration of four to five months [8]. Also, each dataset
consisted of malicious web sessions extracted from application level logs of systems running on the Internet
[8].
The research paper used supervised machine learning methods to automatically classify malicious
web sessions on attacks and vulnerability scans and each web session was characterized with 43 features
reflecting different session characteristics such as number of requests in a session, number of requests of a
specific method type (GET,POST, OPTIONS), number of requests to dynamic application files and length of
request substring within a session [8]. In all, the research paper used three supervised machine learning
methods; namely, Support Vector Machines (SVM), Decision trees based J48, and PART to classify attacker
activities aimed at web systems [8]. According to the paper, results show that Supervised Learning methods
can be used to efficiently distinguish attack sessions from vulnerability scan sessions, with very high probability
of detection and low probability of false alarms[8].
Finally, it is worth stating that the research paper explored the following three research questions;
1. Can Supervised Machine Learning methods be used to distinguish between Web Attacks and
Vulnerability Scans?
2. Do Attacks and Vulnerability Scans differ in small number of features? If so, are these subset
of best features consistent across different datasets?
3. Do some learners perform consistently, better than others across different datasets?
Security Monitoring of Http Traffic Using Extended Flows
A research paper on Security Monitoring of Http Traffic Using Extended Flows states that Http is currently the
most widely used protocol which takes a significant amount of network traffic [9] The paper further explains
that the most suitable way of gaining an overview of Http traffic in a large-scale network is extended network
flow monitoring [9]. There are two approaches to network traffic monitoring, according to the research paper.
These are Deep Packet Inspection (DPI) and Flow Monitoring. DPI is resource demanding but provides detailed
information about a whole packet including a payload [9]. Network Flow Monitoring is fast but is limited to
layers 3 and 4 of the OSI/OSI model but Extended Flow Monitoring is a synergy of the benefits of both methods
[9]. It provides application-level data to traditional flow records while keeping the ability to monitor
large-scales and high-speed networks [9].
The research paper further explains that the correlation of logs from web servers is an option, but also
states that in large networks it is not always possible to gain access to logs or even be aware of all of them [9].
3
4. 4
Thus this research is more significant to Administrators of Large Networks; in general Networks of Academics
and ISPs [9]. The paper also addresses two problems, which are, lack of overview of network traffic and
insufficient security awareness [9]. The paper also states that many Administrators oversee Web Servers and
and their neighbourhood in their administration, but are not aware of security threats in the rest of the
network [9]. The other problem is to find a suitable set of tools to analyze Http traffic and distinguish between
legitimate and malicious traffic [9].
The research paper poses these two research questions;
1. What classes of Http traffic relevant to security can be observed at network level, and what is
their impact on attack detection?
2. What is the added value of extended flow compared to traditional flow monitoring from a
security point of view?
The paper also describes three classes of Http traffic which contain brute-force password attacks,
connection to proxies, and Http Scanners and Web crawlers [9]. Using classification the paper was able to
detect 16 previously undetectable brute-force password attacks and 19 Http Scans per day in their campus [9].
The activities of proxy servers and web crawlers were also observed [9]. Another result of this research paper is
that four network flows were monitored [9]. These are source IP address, destination IP address, hostname,
and Http Requests [9].
Proposed System Model
This section describes our proposed Anomaly Detection System model for the LAMP server. The proposed
Anomaly Detection System employs three different but simple techniques for log file size monitoring, log file
entries classification, and Markov Model of log file sizes.
Log File Size Monitoring
First of all, we will check for the size of log files for Ubuntu Server, Apache, Mysql and PHP. and we will do real
time monitoring of the log files for all these softwares in order to determine the rate of change of the file sizes
on a day-to-day basis.
Also, we will track log file sizes daily to see if the expected new file size is within the expected
threshold based on statistical measures such as mean log file sizes computed based on file sizes for a number
of days, and standard deviation of that data. If during monitoring we see a deviation we will record it as an
anomaly.
Log File Entries Classification
Also, we will analyze the log files and classify every log entry as being, a normal user activity or an abnormal
user activity; whether the file is an access log file or error log file. As such based on that classification model,
we can detect abnormal user activities.
Markov Model of the Log Files Sizes
We will also build a Markov Model of the log file sizes using the data for each day. This will help us infer into
the new log file size for the various software logs and be able to predict roughly, the expected log file size for
the next day. As such, for each day, if what the expected file size should be is not achieved, then we can record
it as an anomaly.
4
5. 5
Conclusion
This research paper describes three techniques that will be employed to detect anomalies on a LAMP server
that probably hosts a web application or ecommerce site. These three techniques are simple to understand
and relatively easy to implement.
References
1. What is Apache https://www.wpbeginner.com/glossary/apache/
2. Ubuntu Server Documentation
https://ubuntu.com/server/docs#:~:text=Ubuntu%20Server%20is%20a%20version,your%20datacentr
e%2C%20public%20or%20private.
3. Detection of Attack Targeted Scans From Apache HTTP Server Access Logs
https://www.sciencedirect.com/science/article/pii/S2210832717300169
4. What is Mysql and How Does it work https://www.hostinger.com/tutorials/what-is-mysql
5. What is Mysql https://dev.mysql.com/doc/refman/8.0/en/what-is-mysql.html
6. What is PHP https://www.php.net/manual/en/intro-whatis.php
7. What is PHP? Learning All about the Scripting Language
https://www.hostinger.com/tutorials/what-is-php/
8. Classification of Malicious web Sessions
https://community.wvu.edu/~kagoseva/Papers/ICCCN-2012.pdf
9. Security Monitoring of Http Traffic Using Extended Flows
https://is.muni.cz/publication/1300438/http_security_monitoring-paper.pdf
10. Analyzing Http Request for Web Intrusion Detection
https://www.semanticscholar.org/paper/Analyzing-HTTP-requests-for-web-intrusion-detection-Althub
iti-Yuan/f3adfc7e7686114ce2cb1a1eb7dc22848fdf13ca
11. Hackin9 Practical Protection Security Magazine
https://www.slideshare.net/RodrigoGomesPires/hakin9-05-2013?from_search=3
5