2. Where do log files come from?
Use Case: Distributed Web Applications
● Web Servers
● Application Servers
● Databases
● Network infrastructure:
Routers / Load balancers / Switches
3. How do Log Files look like?
Example Web Server Log:
● Timestamp
● Source hostname / IP
● Session ID (if available)
● Request URL
● Return code
● Reply size
4. Example Log File
NASA Dataset (http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html)
Two months worth of HTTP request to Kennedy Space Center. (22 MB zipped)
5. What insights do Log Files provide?
● Monitoring
Are my servers up and running as expected?
How much resources are being used used?
Are my business metrics ok? (earnings/h)
● Troubleshooting / Debugging
Why is my system slow?
Where are messages dropped?
● Reporting / Mining
How do my users behave?
Click stream analysis. KPI calculations (click through, bounce rate)
6. Log Volume
Example Calculation for Wikipedia
● 200 application servers
● 20 database servers
● 70 cache servers
● 10k http requests per second (rps) peak load (peek: 50k rps)
● 80k SQL queries per second peak load
Apache Access Log rate:
10k req/sec * 100b log message = 1mb/s (peek: 5mb/s)
= 86gb/day (i.e. BIG)
Source:
http://www.datacenterknowledge.com/archives/2008/06/24/a-look-inside-wikipedias-infrastructure/
http://reportcard.wmflabs.org/
https://ganglia.wikimedia.org/
7. Log File Processing
Web Server
Web Server
Web Server
Log Aggregator
Log Monitor
Log Analytics
Business Logic.
Generate Log Files.
Gathers log files from
individual servers and
stores them on a
central location
Real time reports,
dashboards, plots,
alerts
Batch processing,
data mining
Source: Theo Schlossnagle - Scalable Internet Architectures
8. ● Local storage of log files on web servers
● Periodic “pull” aggregation of log files, via ftp or scp
Drawbacks:
● No real-time access to logs.
● No log files from crashed servers.
Classical Solution
Web Server
Web Server
Web Server
Log Aggregator
Log Monitor
Log Analytics
9. ● Real-time aggregation of logs files (“push”)
● Need to use reliable transfer (syslog only provides UDP)
● Configuration management complicated,
- every web server needs to know about the log aggregator
- problematic if redundancy should be adedd
Real-Time Unicast
Web Server
Web Server
Web Server
Log Aggregator
Log Monitor
Log Analytics
10. Passive “Sniffing” Log aggregation
Web Server
Web Server
Web ServerLog Aggregator
● Log files are produced by sniffing network packages
● Very accurate log files
● No interaction/configuration of server required
● Need single egress point
● Security flaw (man in the middle attack). Not compatible with SSL.
Internet
11. ● Real-time log distribution using group messaging (ZeroMQ/Spread/Thrift)
● Flexible communication patterns (allow multiple subscribers)
● Use reliable IP multicast to reduce network load
● Less configuration overhead (group subscriptions)
Log AnalyticsLog Aggregator
Best practice: Group communication
Web Server
Web Server
Web Server Log Monitor(s)
14. Further Steps
● Chefkoch Datensatz (log files from several months)
1. Inspect data
2. Gather interesting questions to data
3. Try to answer questions using big data processing
● Stream processing vs. batch processing (Thomas)
- Welche queries/operatoren können auf dem stream beantwortet werden
- knowledge discovery / feature selection -> Indexing
Other log file sources
● Sensor data analysis
● Profiling von software projekten (SOAMIG)