2. Who am I?
• Work as a Data Scientist
• Abundant log data
• Using algorithms to better serve our customers
— and their end users
#nginx #nginxconf
3. Exploration of our own Nginx
logs
• We collect all sorts of log messages from our
customers, with very high throughput
• Use both syslog and http protocols
• Nginx receives all our http log messages and
forwards them to our back-end
#nginx #nginxconf
4. What do we mean by
“anomalies”?
• By “anomalies” we mean: isolating events which
are in some way, unexpected or undesirable
• 4xxs and 5xx are in this broad sense “anomalous”
• A first look, using tail and grep, revealed that we
needed to focus on 4xx codes from certain
requests
#nginx #nginxconf
5. Should we worry about 4xx
errors?
• 4xx rates of errors are low and fairly stable in our case, and
we have no 5xx errors
• There is a common belief that 4xx require no further analysis
• For us, however, 4xx are significant:
• Most of our http ingestion is via POST requests
• Yet we also obtain log data via GET requests with
tracking pixels
• Used by some customers for their mobile end-users
#nginx #nginxconf
6. Extracted “Features” or
dimensions from http log data
• Payload size in bytes
• Country of origin
• OS
• Browser
• IP
• Referer host
• Date and time
#nginx #nginxconf
One can get more than
100 features from Nginx log
data alone (including
headers)
7. In my sandbox, filter logs in real time
and send them to a Kafka queue
tail -F /var/log/nginx/access.log | fgrep -v ‘“ 200 ’
| fgrep -v OPTIONS | fgrep gif | awk 'length($0) >
65 {print}’" | ~/kafka_2.10-0.8.2.1/bin/kafka-
console-producer.sh --broker-list localhost:9092 --
topic nginx_filtered_logs
#nginx #nginxconf
Kafka works like a shock
absorber to avoid
propagating bursts
8. A Python script reads from Kafka and
parses logs, using standard libraries
import re, geolite2, woothee, urlparse, time
msg_regex = re.compile('(?P<ipaddress>d{1,3}.d{1,3}.d{1,3}.d{1,3}) - - [(?
P<dateandtime>d{2}/[a-zA-Z]{3}/d{4}:d{2}:d{2}:d{2} (+|-)d{4})] (("(?P<method>GET|
POST|HEAD) )(?P<url>.+)(HTTP/1.d")*) (?P<statuscode>d{3}) (?P<bytessent>d+) (["](?
P<referer>(-)|(.*))["]) (["](?P<useragent>.*)["])')
consumer = KafkaConsumer(“nginx_filtered_logs”, group_id=‘my_group’,
bootstrap_servers=['localhost:9092'])
for message in consumer:
time.sleep(0.2)
msg = message.value
m = msg_regex.match( msg )
ip_match = geolite2.lookup(m.group('ipaddress'))
useragent_match = woothee.parse(m.group('useragent'))
referer_match = urlparse(m.group('referer'))
nginx_status_code = m.group('statuscode')
payload = m.group('url')
size = len(payload)
os = useragent_match["os"]
ipaddress = m.group(‘ipaddress’)
country = ip_match.country
#nginx #nginxconf
Using freely
available open
source parsing
libraries for this
example
9. For visualization, sent parsed
features as messages to Loggly
Finding #1:
“408” http errors correlated with
large GET payload size
#nginx #nginxconf
Plot of payload size split by status code counts over time
10. Finding # 2: Most of our 4xx
errors come from Opera browser
#nginx #nginxconf
Count of 4xx messages by browser type for given time interval
11. Finding # 3: Opera browser users with 4xx
errors mostly from Indonesia, South Africa,
and South Asia (Bangladesh, India, Pakistan)
#nginx #nginxconf
Countries of origin and status codes for Opera browser users
12. What if we could automate
this exploration of anomalies?
• Start with the basics: 4xx and 5xx are anomalies
• These anomalies appear in clusters along with
other dimensions available from HTTP data
• Some use cases call for correlating http
anomalies with application log message
anomalies ( exceptions and errors )…
• …In real time
#nginx #nginxconf
13. Monitoring
and alerting of
rates of
HTTP errors
Multi-dimensional
analysis of HTTP
log data
Advanced
parsing and
automated
clustering
Correlation of
HTTP with other
application data
(exceptions,
errors) in real
time
A vision for growing log data
analytical capabilities
#nginx #nginxconf
Tail | grep | cut |
awk | sort | uniq
of log data
Example
presented
today
Customer
prototype
(TAFLUC)
14. The customer, a Content Management
Platform, needed to identify unusual rates
of 5xx and PHP errors in real time
• Discarded all 200s
• Only looking for anomalies
“within the anomalies”
• Real time analysis by
customer
Rate of 5xx codes over all non-200 codes
Rate of PHP
fatal over all
PHP level
messages
True
Anomalies
#nginx #nginxconf
• Our customer wants to notify
their end-users as they
deploy plug-ins
• Reduce baseline rate of 5xx
errors across platform
15. Conclusion
• Everyone is familiar with simple, one dimensional
analytics: 4xx and 5xx are bad
• This does not always show the full picture
• By expanding the data features available from Nginx logs
its possible to identify more interesting patterns
• Good visualization is important because humans are
better than computers (today)
• Algorithms are also important to automate the
process and save time
#nginx #nginxconf