Logs are one of the most important pieces of analytical data in a cloud-based service infrastructure. At any point in time, service owners and operators need to understand the sta- tus of each infrastructure component for fault monitoring, to assess feature usage, and to monitor business processes. Application developers, as well as security personnel, need access to historic information for debugging and forensic in- vestigations. This paper discusses a logging framework and guidelines that provide a proactive approach to logging to ensure that the data needed for forensic investigations has been gener- ated and collected. The standardized framework eliminates the need for logging stakeholders to reinvent their own stan- dards. These guidelines make sure that critical information associated with cloud infrastructure and software as a ser- vice (SaaS) use-cases are collected as part of a defense in depth strategy. In addition, they ensure that log consumers can effectively and easily analyze, process, and correlate the emitted log records. The theoretical foundations are em- phasized in the second part of the paper that covers the im- plementation of the framework in an example SaaS offering running on a public cloud service. While the framework is targeted towards and requires the buy-in from application developers, the data collected is crit- ical to enable comprehensive forensic investigations. In ad- dition, it helps IT architects and technical evaluators of log- ging architectures build a business oriented logging frame- work.
Cloud Application Logging for Forensics Raffael Marty Loggly, Inc. 78 First Street San Francisco, CA 94105 firstname.lastname@example.orgABSTRACT KeywordsLogs are one of the most important pieces of analytical data cloud, logging, computer forensics, software as a service, log-in a cloud-based service infrastructure. At any point in time, ging guidelinesservice owners and operators need to understand the sta-tus of each infrastructure component for fault monitoring, 1. INTRODUCTIONto assess feature usage, and to monitor business processes. The ”cloud”[5, 11] is increasingly used to deploy and runApplication developers, as well as security personnel, need end-user services, also known as software as a service (SaaS).access to historic information for debugging and forensic in- Running any application requires insight into each infras-vestigations. tructure layer for various technical, security, and business This paper discusses a logging framework and guidelines reasons. This section outlines some of these problems andthat provide a proactive approach to logging to ensure that use-cases that can beneﬁt from log analysis and manage-the data needed for forensic investigations has been gener- ment. If we look at the software development life cycle, theated and collected. The standardized framework eliminates use-cases surface in the following order:the need for logging stakeholders to reinvent their own stan-dards. These guidelines make sure that critical information • Debugging and Forensicsassociated with cloud infrastructure and software as a ser-vice (SaaS) use-cases are collected as part of a defense in • Fault monitoringdepth strategy. In addition, they ensure that log consumers • Troubleshootingcan eﬀectively and easily analyze, process, and correlate theemitted log records. The theoretical foundations are em- • Feature usagephasized in the second part of the paper that covers the im- • Performance monitoringplementation of the framework in an example SaaS oﬀering • Security / incident detectionrunning on a public cloud service. While the framework is targeted towards and requires the • Regulatory and standards compliancebuy-in from application developers, the data collected is crit-ical to enable comprehensive forensic investigations. In ad- Each of these use-cases can leverage log analysis to eitherdition, it helps IT architects and technical evaluators of log- completely solve or at least help drastically speed up andging architectures build a business oriented logging frame- simplify the solution to the use-case.work. The rest of this paper is organized as follows: In Section 2 we discuss the challenges associated with logging in cloud- based application infrastructures. Section 3 shows how these challenges can be addressed by a logging architecture. InCategories and Subject Descriptors section 4 we will see that a logging architecture alone isC.2.4 [Distributed Systems]: Distributed applications not enough. It needs to be accompanied by use-case driven logging guidelines. The second part of this paper (Section 5) covers a references setup of a cloud-based applicationGeneral Terms that shows how logging was architected and implemented throughout the infrastructure layers.Log Forensics 2. LOG ANALYSIS CHALLENGES If log analysis is the solution to many of our needs in cloud application development and delivery, we need to havePermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are a closer look at the challenges that are associated with it.not made or distributed for proﬁt or commercial advantage and that copies The following is a list of challenges associated with cloud-bear this notice and the full citation on the ﬁrst page. To copy otherwise, to based log analysis and forensics:republish, to post on servers or to redistribute to lists, requires prior speciﬁcpermission and/or a fee. • Decentralization of logsSAC’11 March 21-25, 2011, TaiChung, Taiwan.Copyright 2011 ACM 978-1-4503-0113-8/11/03 ...$10.00. • Volatility of logs
• Multiple tiers and layers 2.2 Log Records • Archival and retention What happens if there are no common guidelines or stan- dards deﬁned for logging? In a lot of cases, application de- • Accessibility of logs velopers do not log much. Sometimes, when they do log, • Non existence of logs the log records are incomplete, as the following example il- • Absence of critical information in logs lustrates: • Non compatible / random log formats Mar 16 08:09:58 kernel: [ 0.000000] Normal 1048576 -> 1048576 A cloud-based application stores logs on multiple serversand in multiple log ﬁles. The volatile nature of these re- There is not much information in this log to determinesources1 causes log ﬁles to be available only for a certain what actually happened and what is Normal?period of time. Each layer in the cloud application stack A general rule of thumb states that a log record should begenerates logs, the network, the operating system, the ap- both understandable by a human and easily machine pro-plications, databases, network services, etc. Once logs are cessable. This also means that every log entry should, ifcollected, they need to be kept around for a speciﬁc time possible, log what happened, when it happened, who trig-either for regulatory reasons or to support forensic investi- gered the event, and why it happened. We will later discussgations. We need to make the logs available to a number of in more detail what these rules mean and how good log-constituencies; application developers, system administra- ging guidelines cover these requirements (see Sections 4.2tors, security analysts, etc. They all need access, but only and 4.3).to a subset and not always all of the logs. Platform as a In the next section we will discuss how we need to instru-service (PaaS) providers often do not make logs available to ment our infrastructure to collect all the logs. After that wetheir platform users at all. This can be a signiﬁcant problem will see how logging guidelines help address issues related towhen trying to analyze application problems. For example, missing and incomplete log records.Amazon does not make the load balancer logs available totheir users. And ﬁnally, critical components cannot or arenot instrumented correctly to generate the logs necessary to 3. LOGGING ARCHITECTUREanswer speciﬁc questions. Even if logs are available, they A log management system is the basis for enabling logcome in all kinds of diﬀerent formats that are often hard to analysis and solving the goals introduced in the previousprocess and analyze. sections. Setting up a logging framework involves the fol- The ﬁrst ﬁve challenges can be solved through log man- lowing steps:agement. The remaining three are more intrinsic problemsand have to be addressed through deﬁning logging guidelines • Enable logging in each infrastructure and applicationand standards2 . component • Setup and conﬁgure log transport2.1 Log Management • Tune logging conﬁgurations Solving the cloud logging problems outlined in the lastsection requires a log management solution or architecture 3.1 Enable Loggingto support the following list of features: As a ﬁrst step, we need to enable logging on all infrastruc- ture components that we need to collect logs from. Note that • Centralization of all logs this might sound straight forward, but it is not always easy • Scalable log storage to do so. Operating systems are mostly simple to conﬁgure. • Fast data access and retrieval In the case of UNIX, syslog is generally already setup and logs can be found in /var/log. The hard part with • Support for any log format OS logs is tuning. For example, how do you conﬁgure the • Running data analysis jobs (e.g., map reduce) logging of password changes on a UNIX system? Log- • Retention of log records ging in databases is a level harder than logging in operating systems. Conﬁguration can be very tricky and complicated. • Archival of old logs and restoring on demand For example, Oracle has at least three diﬀerent logging • Segregated data access through access control mechanisms. Each of them has their own sets of features, ad- vantages, and disadvantages. It gets worse though; logging • Preservation of log integrity from within your applications is most likely non-existent. Or • Audit trail for access to logs if it exists, it is likely not conﬁgured the way your use-cases demand; log records are likely missing and the existing log These requirements match up with the challenges deﬁned records are missing critical information.in the last section. However, they do not address the lastthree challenges of missing and non standardized log records. 3.2 Log Transport1 Setting up log transport covers issues related to how logs For example, if machines are pegging at a very high load,new machines can be booted up or machines are terminated are transfered from the sources to a central log collector.if they are not needed anymore without prior warning. Here are issues to consider when setting up the infrastruc-2 Note that in some cases, it is not possible to change any- ture:thing about the logging behavior as we cannot control thecode of third-party applications. • Synchronized clocks across components
• Reliable transport protocol • Errors are problems that impact a single application • Compressed communication to preserve bandwidth user and not the entire platform. • Encrypted transport to preserve conﬁdentiality and in- • Critical conditions are situations that impacts all users tegrity of the application. They demand immediate atten- tion3 .3.3 Log Tuning • System and application start, stop, and restart. Each Log data is now centralized and we have to tune log sources of these events could indicate a possible problem. Thereto make sure we get the right type of logs and the right details is always a reason why a machine stopped or was restarted.collected. Each logging component needs to be visited and • Changes to objects track problems and attribute changestuned based on the use-cases. Some things to think about to an activity. Objects are entities in the application,are where to collect individual logs, what logs to store in such as users, invoices, or goods. Other examples ofthe same place, and whether to collect certain log records at changes that should be logged are:all. For example, if you are running an Apache Web server,do you collect all the log records in the same ﬁle; all the – Installation of a new application (generally loggedmedia ﬁle access, the errors, and regular accesses? Or are on the operating system level).you going to disregard some log records? – Conﬁguration change4 . Depending on the use-cases you might need to log addi-tional details in the log records. For example, in Apache it – Logging program code updates enable attribution ofis possible to log the processing time for each request. That changes to developers.way, it is possible to identify performance degradations by – Backup runs need to be logged to audit successful ormonitoring how long Apache takes to process a request. failed backups. – Audit of log access (especially change attempts).4. LOGGING GUIDELINES To address the challenges associated with the information 4.1.3 Securityin log records, we need to establish a set of guidelines and we Security logging in cloud application is concerned withneed to have our applications instrumented to follow these authentication and authorization, as well as forensics sup-guidelines. These guidelines were developed based on exist- port5 . In addition to these three cases, security tools (e.g.,ing logging standards and research conducted at a number intrusion detection or prevention system or anti virus tools)of log management companies[4, 15, 20, 30, 33]. will log all kinds of other security-related issues, such as at- tempted attacks or the detection of a virus on a system.4.1 When to Log Cloud-applications should focus on the following use-cases: When do applications generate log records? Making thedecision when to write log records needs to be driven by • Login / logout (local and remote)use-cases. These use-cases in cloud applications surface in • Password changes / authorization changesfour areas: • Failed resource access (denied authorization) • Business relevant logging • All activity executed by a privileged account • Operations based logging Privileged accounts, admins, or root users are the ones • Security (forensics) related logging that have control of a system or application. They have • Regulatory and standards mandates privileges to change most of the parameters in the applica- tion. It therefore is crucial for security purposes to monitor As a rule of thumb, at every return call in an application, very closely what these accounts are doing6 .the status should be logged, whether success or failure. Thatway errors are logged and activity throughout the applica- 4.1.4 Compliancetion can be tracked. Compliance and regulatory demands are one more group of use-cases that demand logging. The diﬀerence the other4.1.1 Business use-cases is that it is often required by law or by business Business relevant logging covers features used and busi- partners to comply with these regulations. For example,ness metrics being tracked. Tracking features in a cloud the payment card industry’s data security standard (PCIapplication is extremely crucial for product management. It DSS) demands a set of actions with regards to logginghelps not only determine what features are currently used, (see Section 10 of PCI DSS). The interesting part aboutit can also be used to make informed decisions about the the PCI DSS is that it demands that someone reviews thefuture direction of the product. Other business metrics that 3you want to log in a cloud application are outlined in . Exceptions should be logged automatically through the ap- Monitoring service level agreements (SLAs) fall under the plication framework. 4topic of business relevant logging as well. Although some of For example a reconﬁguration of the logging setup is im- portant to determine why speciﬁc log entries are not loggedthe metrics are more of operational origin, such as applica- anymore.tion latencies. 5 Note that any type of log can be important for forensics, not just security logs.4.1.2 Operational 6 Note also that this has an interesting eﬀect on what user Operational logging should be implemented for the follow- should be used on a daily basis. Normal activity should noting instances: be executed with a privileged account!
logs and not just that they are generated. Note that most (CEE) is a new standard that is going to be based onof the regulations and standards will cover use-cases that the following syntax:we discussed earlier in this section. For example logging time=2010-05-13 13:03:47.123231PDT,privileged activity is a central piece of any regulatory logging session_id=08BaswoAAQgAADVDG3IAAAAD,eﬀort. severity=ERROR,user=pixlcloud_zrlram, object=customer,action=delete,status=failure,4.2 What to Log reason=does not exist We are now leaving the high-level use-cases and infras-tructure setup to dive into the individual log records. How There are a couple of important properties to note aboutdoes an individual record have to look? the example log record: At a minimum, the following ﬁelds need to be present in First, each ﬁeld is represented as a key-value pair. This isevery log record: Timestamp, Application, User, Session ID, the most important property that every records follows. ItSeverity, Reason, Categorization. These ﬁelds help answer makes the log ﬁles easy to parse by a consumer and helpsthe questions: when, what, who, and why. Furthermore, with interpreting the log records. Also note that all of thethey are responsible for providing all the information de- ﬁeld names are lower case and do not contain any punctua-manded by our use-cases. tions. You could use underscores for legibility. A timestamp is necessary to identify when a log record Second, the log record uses three ﬁelds to establish a cat-or the recorded event happened. Timestamps are logged in egorization schema or taxonomy: object, action, and status.a standard format. The application ﬁeld identiﬁes the Each log record is assigned exactly one value for each of theseproducer of the log entry. A user ﬁeld is necessary to iden- ﬁelds. The category entries should be established upfronttify which user has triggered an activity. Use unique user for a given environment. Standards like CEE are trying tonames or IDs to distinguish users from each other. A session standardize an overarching taxonomy that can be adoptedID helps track a single request across diﬀerent applications by all log producers. Establishing a taxonomy needs to beand tiers. The challenge is to share the same ID across based on the use-cases and allow for classifying log recordsall components. A severity is logged to ﬁlter logs based on in an easy way. For example, by using the object value oftheir importance. A severity system needs to be established. ’customer’, one can ﬁlter by customer related events. Or byFor example: debug, info, warn, error, and crit. The same querying for status=failure, it is possible to search for allschema should be used across all applications and tiers. A failure indicating log records across all of the infrastructurereason is often necessary to identify why something has hap- logs.pened. For example, access was denied due to insuﬃcient These are recommendations for how log entries should beprivileges or a wrong password. The reason identiﬁes why. structured. To deﬁne a complete syntax, issues like encodingAs a last set of mandatory ﬁeld, category or taxonomy ﬁelds have to be addressed. For the scope of this paper, thoseshould be logged. issues are left to a standard like CEE. Categorization is a method commonly used to augmentinformation in log records to allow addressing similar events 4.4 Don’t forget the infrastructurein a common way. This is highly useful in, for example, We talked a lot about application logging needs. However,reporting. Think about a report that shows all failed logins. do not forget the infrastructure. Infrastructure logs can pro-One could try to build a really complicated search pattern vide very useful context for application logs. Some examplesthat ﬁnds failed login events across all kinds of diﬀerent are ﬁrewalls, intrusion detection and prevention systems, orapplications7 or one could use a common category ﬁeld to web content ﬁlter logs which can help identify why an ap-address all those records. plication request was not completed. Either of these devices might have blocked or altered the request. Other exam-4.3 How to log ples are load balancers that rewrite Web requests or modify them. Additionally, infrastructure issues, such as high laten- What ﬁelds to log in a log record is the ﬁrst piece in the cies might be related to overloaded machines, materializinglogging puzzle. The next piece we need is a syntax. The in logs kept on the operating systems of the Web servers.basis for a syntax is rooted in normalization, which is the There are many more infrastructure components that canprocess of taking a log entry and identifying what each of be used to correlate against application behavior.the pieces represent. For example, to report on the top usersaccessing a system we need to know which part of each logrecord represents the user name. The problems that can be 5. REFERENCE SETUPsolved with normalized data are also called structured data In the ﬁrst part of this paper, we have covered the theoret-analysis. Sometimes additional processes are classiﬁed under ical and architectural basis for cloud application log manage-normalization. For example, normalizing numerical values ment. The second part of this paper discusses how we haveto fall in a predeﬁned range can be seen as normalization. implemented a application logging infrastructure at a soft-There are a number of problems with normalization. Es- ware as a service (SaaS) company. This section presents apecially if the log entries are not self-descriptive. More on number of tips and practical considerations that help overallthat in Section 4.3. logging use-cases but most importantly support and enable The following is a syntax recommendation that is based forensic analysis in case of an incident.on standards work and the study of many existing logging Part of the SaaS architecture that we are using here isstandards (e.g., [9, 17, 29, 35]). Common event expression a traditional three-tiered setup that is deployed on top of the Amazon AWS cloud. The following is an overview of7 the application components and a list of activities that are Each application logs failed logins in completely diﬀerentformats. logged in each component:
conﬁguration. The modules replace the remote ip (%h) ﬁeld logs contain the same session ID as the original request thatwith the value of the X-Forwarded-For header in the HTTP triggered the backend process.request if the request came from a load balancer. We ended Another measure we are tracking very closely is the num-up using mod rpaf with the following conﬁguration: ber of exceptions in the logs. In a ﬁrst instance, we are monitoring them to ﬁx them. However, not all of them canLoadModule rpaf_module mod_rpaf.so be ﬁxed or make sense to be ﬁxed. What we are doing how-RPAFenable On ever, is monitoring the number of exceptions over time. AnRPAFsethostname On increase shows that our code quality is degrading and weRPAFproxy_ips 10.162.65.208 10.160.137.226 can take action to prioritize work on ﬁxing them. This has shown to be a good measure to balance feature vs. bug- This worked really well, until we realized that on Amazon related work and often proofs invaluable when investigatingAWS, the load balancer constantly changes. After deﬁning security issues.a very long chain of RPAFproxy ips, we started looking intoanother solution, which was to patch mod rpaf to work asdesired. 6. FUTURE TOPICS There are a number of issues and topics around cloud-5.4 MySQL based application logging that we haven’t been able to dis- Our MySQL setup is a such that we are using Amazon’s cuss in this paper. However, we are interested in addressingRelational Database Service. The problem with this ap- them at a future point in time are: security visualization,proach being that we do not get MySQL logs. We are looking forensic timeline analysis, log review, log correlation, andinto conﬁguring MySQL to send logs to a separate database policy monitoring.table and then exporting the information from there. Wehaven’t done so yet. 7. CONTRIBUTIONS One of the challenges with setting up MySQL logging will To date, there exists no simple and easy to implementbe to include the common session ID in the log messages to framework for application logging. This paper presents aenable correlation of the database logs with application logs basis for application developers to guide the implementa-and Web requests. tion of eﬃcient and use-case oriented logging. The paper5.5 Operating system contributes guidelines for when to log, where to log, and exactly what to log in order to enable the three main log Our infrastructure heavily relies on servers handling re- analysis uses: forensic investigation, reporting, and correla-quests and processing customer data in the back end. We tion. Without following these guidelines it is impossible toare using collectd on each machine to monitor individual forensically recreate the precise actions that an actor under-metrics. The data from all machines is centrally collected took. Log collections and logging guidelines are an essentialand a number of alerts and scripted actions are triggered building block of any forensic process.based on predeﬁned thresholds. The logging guidelines in this paper are tailored to to- We are also utilizing a number of other log ﬁles on the day’s infrastructures that are often running in cloud envi-operating systems for monitoring. For example, some of our ronments, where asynchronous operations and a variety ofservers have very complicated ﬁrewall rules deployed. By diﬀerent components are involved in single user interactions.logging failed requests we can identify both mis-conﬁgurations They are designed for investigators, application developers,and potential attacks. We are able to identify users that are and operations teams to be more eﬃcient and to better sup-trying to access their services on the platform, but are using port their business processes.the wrong ports to do so. We mine the logs and then alertthe users of their mis-conﬁgurations. To assess the attacks,we are not just logging blocked connections, but also se- 8. BIOGRAPHYlected passed ones. We can for example monitor our servers Raﬀael Marty is an expert and author in the areas offor strange outbound requests that haven’t been seen be- data analysis and visualization. His interests span anythingfore. Correlating those with observed blocked connections related to information security, big data analysis, and infor-gives us clues as to the origin of these new outbound con- mation visualization. Previously, he has held various posi-nections. We can then use the combined information to tions in the SIEM and log management space at companiesdetermine whether the connections are benign or should be such as Splunk, ArcSight, IBM research, and PriceWater-of concern. house Coopers. Nowadays, he is frequently consulted as an industry expert in all aspects of log analysis and data visu-5.6 Backend alization. As the founder of Loggly, a logging as a service We are operating a number of components in our SaaS’s company, Raﬀy spends a lot of time re-inventing the log-backend. The most important piece is a Java-based server ging space and - when not surﬁng the California waves - hecomponent we instrumented with logging code to monitor can be found teaching classes and giving lectures at securitya number of metrics and events. First and foremost we conferences around the world.are looking for errors. We instrumented Log4J to logthrough syslog. The severity levels deﬁned earlier in thispaper are used to classify the log entries. In order to correlate the backend logs with any of theother logs, we are using the session ID from the originalWeb requests and pass them down in any query made to theback end as part of the request. That way, all the backend