Importance Of Structured Incident Response Process
Importance Of Structured Incident Response Process
Anton Chuvakin, Ph.D., GCIA, GCIH, GCFA
SANS Six Step Incident Response Methodology...................................................4
Incident Response Tools........................................................................................6
Example Corporation – Worm Incident Revisited...................................................7
Common Mistakes of Incident Response.............................................................10
Security is a rapidly changing field of human endeavor. Threats we face literally
change every day; moreover, many security professionals consider the rate of
change to be accelerating. On top of that, to be able to stay in touch with such
ever-changing reality, one has to evolve with the space as well. Thus, even
though I hope that this document will be useful for to my readers, please keep in
mind that is was possibly written years ago. Also, keep in mind that some of the
URL might have gone 404, please Google around.
Right around lunchtime, a helpdesk operator at Example Corporation -- a
medium-sized manufacturing company – receives calls from several users all
reporting computer failures and slow network response. Example Corporation’s
security infrastructure includes firewalls, intrusion detection systems, anti-virus
software and operating system logs, all technology investments from the “boom”
years. The helpdesk operator opens a new trouble ticket in Remedy, describing
the users’ problems and recording the machines’ hostnames. Other unrelated
support issues continue to pile up and the operator’s attention is directed
Meanwhile, the worm, which caused the above laptop problems, continues to
spread throughout Example’s network. The malicious software made its way into
Example after being brought in by one of the sales people who often plugs his
laptop into untrusted networks, such as hotels and customer environments,
outside the company. With most of the Example’s security monitoring capabilities
deployed in a DMZ and on a network perimeter, the remainder of Example’s
vulnerable corporate assets are largely unguarded and unwatched. Thus, as the
worm wends its way around Example’s enterprise, the company security team is
not even aware of a developing disaster.
Soon, network traffic generated by the worm has increased dramatically, as more
machines become infected and start spewing copies of the same worm. When
the infection reaches critical levels and starts to affect the performance of
monitored servers, the security team is notified by a flood of pager alerts… chaos
ensues. While some try installing anti-virus updates other apply firewall blocks
(preventing not only worm scanning, but also the download of updates) and yet
others try to scan for vulnerable machines that contributes to the network-level
After hours of uncoordinated activities, most of the worm-carrying machines are
discovered and the re-infection rate is brought under control. A management
requested investigation begins and computer forensic consultants are brought in.
However, what remained of the initial infection evidence was either destroyed or
extremely hard to find due to “mitigation” activities that were implemented. No
one remembered the original Remedy incident recorded by the helpdesk
operator since the helpdesk system was not deemed relevant for security
information. The investigation was able to conclude only that the malicious
software was brought in from outside the company -- the specific initial infection
vector was never determined.
The financial and technological damage is easy to see. And yet, the recurring
security incident described above shows what happens when companies lack a
central point from which to manage security incidents.
Security professionals learn to constantly chant the mantra “prevention-detection-
response.” Each of these three components is known to be of crucial importance
to the organization’s security posture. However, unlike detection and prevention,
the response is impossible to avoid. While it is not uncommon for the
organizations to have weak prevention and nearly non-existent detection
capabilities, response will have to be there since the organization will often be
forced into response mode by the attackers (be it the internal abuser,
omnipresent “script kiddie” or the elusive “uber-hacker”) or their evil creations
(viruses, worms and spyware). The organization will likely be made to respond in
some way after the incident has taken place. Even in cases where ignoring the
incident that happened might be the chosen option, the organization will implicitly
follow a response plan, even if as ineffective as to do nothing.
In light of this, being prepared for incident response is likely to be one of the most
cost effective security measures the organization takes. Timely and effective
incident response is directly related to decreasing the incident-induced loss to the
organization. It can also help to prevent an expensive and hard-to-repair
reputation damage, which often occurs following the security incident. Several
industry surveys have identified that public company's stock price may plunge
several percent as a result of a publicly disclosed incident
(http://www.securityfocus.com/news/11197). Incidents that are known to wreak
catastrophic results upon the organizations may involve malicious hacking, virus
outbreaks, economic espionage, intellectual property theft, network access
abuse, theft of IT resources and other policy violations.
Most of us in the security industry are already familiar with the traditional
challenges we face every day… too much security data to sift through, too many
false alarms to deal with, and not enough budget or resource to handle an ever-
growing number of security incidents. One additional and often overlooked
challenge involves the security management process itself. Largely ignored in
many of today’s IT enterprises, a clearly defined, documented, and repeatable
incident management process defined in an incident response plan is
fundamental to ensuring fast and accurate handling of security incidents.
Even if an explicit incident response plan is lacking, after the incident occurs the
questions such as these might be asked by the company management:
• What to do now?
• How to put it the way it was?
• How to prevent recurrence?
• How we should have prepared?
• Should we try to figure who is responsible?
Answering these questions requires knowledge of your computing environment,
company culture and internal procedures, implemented technical security and
policy countermeasures. Effective incident response fuses together technical and
non-technical resources, bound by the incident response policy, procedures and
plans. Such policy should be continuously refined and improved, based on the
organization's incident history, just as the main security policy should be.
To build an initial incident resolution management framework one can use SANS
Six Step incident response methodology. This approach was originally developed
for US Department of Energy, adopted elsewhere in the US government and
then popularized by the SANS Institute
The methodology includes the following six steps:
SANS Six Step Incident Response Methodology
Overall, the SANS methodology allows an organization to give structure to the
otherwise chaotic incident response workflow. The steps of the SANS
methodology are both clearly defined and easy to follow, and most importantly,
work in the high-stress post-incident environments for which they were designed.
Following the steps is as easy as selecting and appropriately customizing the
procedures for each case at hand. Using the SANS pre-defined procedures
assures that an incident response workflow will become relatively painless and
the crucial steps will not be missed. Additionally, such a system will facilitate
both training and collaboration between various response team members, who
can share the workload for increased efficiency.
Finally, integrating the SANS methodology into an overall incident response
planning assures today’s IT organizations that they have a comprehensive
approach in-place to tackle security incidents. It also demonstrates compliance
with industry “best practices”, which is sometime associated with regulatory
compliance. Having a repeatable incident management process is highlighted in
several recent regulations, such as HIPAA.
Let’s spend just a moment reviewing a few key features of the SANS Six Step
Incident Response methodology:
The Preparation stage covers everything one should do before handling the first
incident. It involves both technology issues, such as preparing response and
forensics tools, learning the environment, configuring systems for optimal
response and monitoring, as well as business issues -- such as assigning
responsibility, forming a team and establishing escalation procedures.
Additionally, this stage covers the steps necessary to increase a company’s
security posture and thus decrease the likelihood and damage from future
incidents. Security audits, patch management, employee security awareness
program and other security tasks all serve to prepare the organization for incident
action. Building a culture of security and a secure computing environment also
serves as incident preparation.
Specifically, establishing a real-time system and network security event
monitoring program will help to receive early warnings about the hostile activities
as well as collect evidence after the incident. Providing a single view into your
security infrastructure goes a long way towards being more prepared and
equipped to deal with the incidents as they occur as well as cleaning up in the
aftermath. Single evidence storage allows performing sophisticated data
analysis, leading to better awareness of threats and vulnerabilities.
Identification is what happens first when an incident is suspected or detected.
Determining whether the observed event does in fact constitute an incident (as
defined above) is of crucial importance. Careful record keeping is very important,
since such documentation will be heavily used at later stages of the response
process. One should record everything that was observed in relation to the
incident, whether online or in the physical environment. During this stage, it is
important that people responsible for incident handling maintain the proper chain
of custody (explained here http://en.wikipedia.org/wiki/Chain_of_custody as
“document or paper trail showing the seizure, custody, control, transfer, analysis,
and disposition of physical and electronic evidence.”). Contrary to popular
opinion, this is important even when the case is never destined to end up in
court. Following established and approved procedures will help the investigation
that is internal to the company.
Various security technologies play a role in incident identification. For example,
firewall, IDS, server and application logs reveal evidence of potentially hostile
activities, coming from both outside and inside the protected perimeter. Logs are
often tantamount in finding the party responsible for those activities. Security
event correlation is essential for high quality incident identification, due to its
ability to uncover patterns in incoming security event flow. Collecting various
audit logs and correlating them in near real-time goes a long way towards making
the identification step of the response process less laborious. Additionally,
incident identification is greatly helped by “qualifying” the IDS and other alerts
using other environment context, such as system and application vulnerabilities,
running applications as well as business value.
Containment is what keeps the incident from spreading and thus incurring
higher financial or other loss. During this stage, the incident responders will
intervene and attempt to limit the damage, such as by tightening network or host
access controls, changing system passwords, disabling accounts, etc. While
completing the above steps, one should make every effort to keep all the
potential evidence intact, balancing the needs of system owners and incident
investigators. The backup of affected systems is also essential at this step. This
is done to preserve the system for further investigation as well as remediation.
The important decision on whether to continue operating the affected assets
should be made by the appropriate authorities during this stage.
Automated containment measures, such as firewall blocking, system
reconfiguration or forced file integrity checks, and the use of intrusion preventions
solution (in the inline mode) can also be used, if driven by event correlation and
more intelligent analytics. However, automated containment will likely become
widely accepted in the future.
Eradication is the only stage when the factors leading to the incident are
eliminated or mitigated. Such factors often include system vulnerabilities, unsafe
system configurations, out-of-date protection software or even imperfect physical
access control. Also, the non-technology controls such as building access
policies or key card privileges might be adjusted at this stage. In the case of a
hacker-related incident, the affected systems are likely to be restored from the
last clean backup or rebuilt from the operating system vendor media with all
Time is most critical during the eradication stage. The first response should
satisfy several often conflicting criteria, such as accommodating the system
owners requests, preserving evidence, stopping the spread of damage while
complying to all the appropriate organization's policies.
Recovery is the stage where the organization's operations return to normal.
Systems are restored and configured to prevent recurrence and are returned to
regular use. To insure that the newly established controls are working, the
organization might want to maintain increased monitoring of the affected assets
for some period of time.
Return to production is always a critical step. If done too early, there is a
significant risk of recurrence; if done too late, it risks upsetting the business
owners. Thus, it should be clearly documented in the incident procedures during
the preparation stage.
Follow-Up is an extremely important stage of the incident response process.
Just as the preparation stage above, proper incident follow-up helps to ensure
that lessons are learned from the incident and that the overall security posture
improves as a result. Additionally, follow-up is important in order to prevent the
recurrence of similar incidents. Additionally, a report on the incident is often
submitted to the senior management. It covers the actions taken, summarizes
the lessons learned and also serves as a knowledge repository in case of similar
incidents in the future.
Follow-up steps often need to be distributed to a wider audience than the rest of
the investigation process. Enterprise-wide security knowledge base helps to
address this challenge. It will ensure that IT resource owners will be more
prepared to combat future threats. To optimize the distribution of incident
information, one can use various forms and templates, prepared in advanced for
different types of incidents. Properly sanitized past incident cases should also be
added to an organization-wide security knowledge base, in addition to the
industry security resources and vulnerability knowledge. Such materials can later
be used for training new incident responders as well as broader IT audience. A
summary of suggested actions might also be sent to the senior management.
Incident Response Tools
While people and processes are important, tools is what completes the security
triangle. When the incident is suspected, the response team will need the tools to
verify its status, assess damage that was incurred as well as can be occurred
and then proceed to contain and recover from the incident. This involves a wide
range of tools from intrusion detection to forensics and vulnerability
management. Backup tools should also not be overlooked. Tools helpful for
incident management can be organized as such:
Tools Common uses during incident response
Evidence collection System and security logs, audit trails, disk
and storage images, email and other communication
Data analysis and Correlation, searching and reporting,
forensics forensics discovery activities
Collaboration Incident team communication, workflow,
Backup Evidence preservation, “known good”
configuration retention, user data recovery
Documentation Actions logged for audit and improvement,
reporting, incident team performance
measurement, lessons learned, future team
Some tools are helpful in more than one of the above category. For example, a
Security Information Management (SIM) solution often holds most of the
evidence from the scene of the information security incident. Incident handling is
a natural SIM product functionality aimed at gathering and organizing security
event data around incidents and also enforcing proper response workflow in
order to facilitate effective and prompt response to security incidents.
Specifically, a SIM can
• Facilitates the effective handling process
• Integrates evidence storage and analysis
• Enforces proper access control to evidence
• Enables team collaboration
• Simplifies resolution monitoring and reporting
• Makes security measurable
In general, it establishes a single control point of the security response
capabilities by combining the major potential evidence storage with the
Other tools that an incident team needs to be very familiar with include disk
image forensics tools, covering the whole lifecycle from making a forensics copy
of the suspect’s workstation to final evidence presentation to an internal authority
or law enforcement. Those tools do require significant training, especially if used
for cases where court trial is likely.
Example Corporation – Worm Incident Revisited
A network helpdesk operator receives calls from several users – all reporting
computer failures and slow network response. Using a newly established
process, a trained team and right tools, an incident case is opened according to
the plan and user complaints from that department are summarized and
presented to all relevant parties, including the security team contact. The affected
machines together with the information on their owners are also added to
corresponding case fields. The operator then assigns the case to the security
event monitoring team, as mandated by his instructions, derived from the incident
Upon receiving the assignment through the case management system, a
monitoring team member run several queries searching for suspicious events to
and from the affected machines – all as part of the incident identification
procedure defined by the company. He discovers that a network IDS has
detected an email worm being transmitted from outside the environment. The
monitoring team member shares the incident case with the security analyst team,
running the intrusion detection, so they can verify the impact of the IDS events,
based on the affected asset business role and importance. Many events
reported by the anti-virus systems running on some of the user's desktops were
also reported from the affected IP addresses. As a next step, an analyst selects a
Containment procedure from the knowledge base, which involves quarantining
the infected machines by applying a firewall rule to prevent the spread of the
worm. The procedure is added to the incident case and then implemented.
Next, it is necessary to clean the infected PCs. The Mitigation procedure
involves installing and running full scan using a freshly updated copy of anti-virus
software. The security engineering team together with security analyst team
verifies compliance of the newly installed anti-virus system with the company's
The recommended Follow-up procedure includes a mandated company-wide
desktop anti-virus deployment from a dedicated server. The procedure is then
submitted for management approval and, once approved, the remediation team
assures that the anti-virus software is pushed out to all company desktop PC’s
and the incident case is closed.
Here is another example of how a company with a well-tuned incident response
process handles an attack against the web server.
A security analyst on duty received an email notification when a correlated event
on a successful attack was triggered by SIM solution. An analyst has discovered
that a real-time correlation rule was matched by a series of events directed
against the auxiliary web server.
By logging into their SIM and running a report, the analyst has found out that the
triggered rule aims to detect high-severity attacks against the web server, which
are preceded by the reconnaissance activity, such as a server version query. The
web server was first probed for its type and version and later attacked by a
known exploit detected by the network intrusion detection system. The company
security monitoring procedure mandated that such be investigated.
Thus, the analyst clicked on the correlated event in the corresponding report and
chose to add it to a new incident case. He then added a note saying that he
received an email notification and started the investigation in accordance with the
After the case was registered by the system, the analyst proceeded to investigate
the related events. He opened the report to view the raw security events that
triggered the correlation. Such events included probes against multiple servers
followed by an attack. He looked at the attack details and found out that the IDS
signature for the exploit matched the server type and the operating system. He
added all the related events to the incident case as well.
Further, he run an query to look for more traces of the same attacker’s IP
address (the source) in the event database. Multiple entries indicative of
scanning, denied connections on the firewall and TCP port 80 attempts across
the enterprise were discovered. The report results were also added to the
At that stage it was obvious that a consistent attack was in progress. The note
was added to the case Identification section saying that the incident is confirmed
and several servers might have been impacted.
The analyst then searched all events involving the attacker web server. No
suspicious activity has originated from it. However, since the server was not a
business critical asset, it was possible to take it offline for investigation. This
decision was recorded in the Containment section of the incident case and the
server was taken offline.
The detailed server investigation that followed has not revealed any signs of a
successful compromise. However, the server logs contained evidence of a
multiple failed exploit attempts. The server was also found missing several critical
patches. Their lack was apparently not detected by the attacker. It was decided
to patch the server before the regular maintenance window and to return it
online. It was also decided to increase the logging level on the server. The
respective note was made in the Mitigation section of the incident case and the
above steps were performed.
After the server was returned into operation, the analyst has assigned the case to
the incident manager who had the authority to review the performed steps and to
close the case. The manager added several notes to the follow-up section, which
suggested that servers in that subnet be scanned for vulnerabilities more often.
The case was then closed.
Common Mistakes of Incident Response
While many organizations are on the path towards organizing their incident
response, many pitfalls lay in wait for them on the path to incident management
nirvana. This section summarizes several mistakes that companies make in their
security incident response.
# 1 Not having a plan
The first mistake is simply not creating an incident response plan before incidents
start happening. Having a plan in place (even a plan that is not well-thought)
makes a world of difference! Such plan should cover all the stages of incident
response process from preparing the infrastructure to first response all the way to
learning the lessons of a successfully resolved incident.
If you have a plan, then after the initial panic phase, ('Oh, my, we are being
hacked!!!') you can quickly move into a set of planned activities, including a
chance to contain the damage and curb the incident losses. Having a checklist to
follow and a roster of people to call is of paramount importance in a stressful
To jump-start the planning activity one can use a ready-made methodology, such
as SANS Institute 6-step incident response process, covered above. With a plan
and a methodology your team will soon be battle hardened and ready to respond
to the next virus faster and more efficiently. As a result, you might manage to
contain the damage to your organization.
# 2 Failing to increase monitoring and surveillance
The second mistake is not deploying increased monitoring and surveillance after
an incident has occurred. This is akin to shooting yourself in the foot during the
incident response. Even though some companies cannot afford 24/7 security
monitoring, there is no excuse for not increasing monitoring after an incident has
At the very least, one of the first things to do after an incident is to crank up all
the logging, auditing and monitoring capabilities in the affected network and
systems. This simple act has the potential to make or break the investigation by
providing crucial evidence for identifying the cause of the incident and resolving
it. It often happens that later in the response process, the investigators discover
that some critical piece of log file was rotated away or an existing monitoring
feature was forgotten in an 'off' state. Having plenty of data on what was going
on in your IT environment right after the incident will not just make the
investigation easier, it will likely make it successful.
Another side benefit, is that increased logging and monitoring will allow the
investigators to confirm that they indeed have followed the established chain of
#3. Being unprepared for a court battle
The third mistake is often talked about, but rarely avoided. Some experts have
proclaimed that every security incident needs to be investigated as if it will end
up in court. In other words, maintaining forensic quality and following the
established chain of custody needs to be assured during the investigation.
Even if the case looks as if it will not go beyond the suspect's manager or the
human resources department (in the case of an internal offense) or even the
security team itself (in many external hacking and virus incidents), there is
always a chance that it will end up in court. Cases have gone to court after new
evidence was discovered during an investigation, and, what was thought to be a
simple issue of inappropriate Web access became a criminal child pornography
Moreover, while you might not be expecting a legal challenge, the suspect might
sue in retaliation for a disciplinary action against him or her. A seasoned incident
investigator should always consider this possibility.
In addition, following a high standard of investigative quality always helps since
the evidence will be that much more reliable and compelling, if it can be backed
up by a thorough and well-documented procedure.
#4. Putting it back the way it was
The fourth mistake is reducing your incident response to "putting it back the way
it was". This often happens if the company is under deadline to restore the
functionality. While this motive is understandable, there is a distinct possibility
that failing to find out why the incident occurred will lead to repeat incidents, on
the same or different systems.
For example, in the case of a hacking incident, if an unpatched machine that
was compromised is rebuilt from the original OS media, but the exploited
vulnerability is not removed, the hackers are very likely to come back and take it
over again. Moreover, the same fate will likely befall other exposed systems.
Thus, while returning to operation might be the primary goal, don’t lose sight of
the secondary goal: figuring out what happened and how to prevent it from
happening again. It feels bad to be on the receiving end of the successful attack,
but it feels much worse to be hit twice by the same threat and have you defenses
fell in both cases.
Incident response should not be viewed as a type of "firefighting" although you’d
fight plenty of fires in the process. It can clearly help in case of a fire, but it can
also help prevent fires in the future.
#5. Not learning from mistakes
The final mistake sounds simple, but it is all too common. It is simply not learning
from mistakes! Creating a great plan for incident response and following it will
take the organization a long way toward securing the company, but what is
equally important is refining your plan after each incident, since the team and the
tools might have changed over time.
Another critical component is documenting the incident as it is occurring, not just
after the fact. This assures that the "good, the bad and the ugly" of the handling
process will be captured, studied and lessons will be drawn from it. The results of
such evaluations should be communicated to all the involved parties, including IT
resource owners and system administrators.
Ideally, the organization should build an incident-related knowledge base, so that
procedures are consistent and can be repeated in practices. The latter is very
important for regulatory compliance as well and will help satisfying some of the
Sarbanes-Oxley requirements for auditing the controls to information.
While the above cases are simplistic in nature they readily show the need for any
security management system to have not only an incident response plan but also
an integrated incident handling system to ensure complete and effective
response planning deployment. Having a highly efficient plan helps organizations
save money by limiting the impact on core business from security incidents and
increasing the efficiency of existing security infrastructure investments. Overall,
the SANS process allows one to give structure to the otherwise chaotic incident
response workflow. It defines the steps that will then be followed under incident-
induced stress with high precision.
In fact, many of the above steps may be built from the pre-defined procedures.
Following the steps will then be as easy as selecting and sometimes customizing
the procedures for each case at hand. Incident handling workflow will become
more streamlined and the crucial steps will not be missed and documented
properly. Using pre-defined procedures also helps train the incident response
staff on proper actions for each process step. The automated system may be
built to keep track of the response workflow, to suggest proper procedures for
various steps and to securely handle incident evidence. Additionally, such a
system will facilitate collaboration between various response team members,
who can share the workload for increased operational efficiency.
What is even more important, monitoring incident resolution activities allows the
organization to implement effective security metrics. It is one thing to count
number of alerts or events flowing from various sensors, but to take security
assessment to the next level one needs to measure the performance of the
whole security process, involving both people (such as security team members
working on the incident cases) and technologies.
ABOUT THE AUTHOR:
This is an updated author bio, added to the paper at the time of reposting in
Dr. Anton Chuvakin (http://www.chuvakin.org) is a recognized security expert in
the field of log management and PCI DSS compliance. He is an author of books
"Security Warrior" and "PCI Compliance" and a contributor to "Know Your Enemy
II", "Information Security Management Handbook" and others. Anton has
published dozens of papers on log management, correlation, data analysis, PCI
DSS, security management (see list www.info-secure.org) . His blog
http://www.securitywarrior.org is one of the most popular in the industry.
In addition, Anton teaches classes and presents at many security conferences
across the world; he recently addressed audiences in United States, UK,
Singapore, Spain, Russia and other countries. He works on emerging security
standards and serves on the advisory boards of several security start-ups.
Currently, Anton is developing his security consulting practice, focusing on
logging and PCI DSS compliance for security vendors and Fortune 500
organizations. Dr. Anton Chuvakin was formerly a Director of PCI Compliance
Solutions at Qualys. Previously, Anton worked at LogLogic as a Chief Logging
Evangelist, tasked with educating the world about the importance of logging for
security, compliance and operations. Before LogLogic, Anton was employed by a
security vendor in a strategic product management role. Anton earned his Ph.D.
degree from Stony Brook University.
Security event is a single observable occurrence as reported by a security device
or application or noticed by the appropriate personnel. Thus, both IDS alert and
security-related helpdesk call will qualify as security events.
Security incident is an occurrence of one or several security events that have a
potential to cause undesired functioning of IT resources or other related
problems. Thus, that limits our discussion to information security incidents, which
cover computer and network security, intellectual property theft and many other
Incident response (or IR) is a process of identification, containment, eradication
and recovery from computer incidents performed by a responsible security team.
It is worthwhile to note, that the security team might consist of just one person,
who might only be a part-time incident responder. However, whoever takes part
in dealing with the incident consequences implicitly becomes part of the incident
response team, even if such team does not exist as organization’s part.
Incident case is a collection of evidence and associated workflow related to a
security incident. Thus, the case is a history of what happened, what was done
with evidence supporting both items above. It might include various documents
such as reports, security event data, results of audio interviews, images files and
Incident report is a document prepared as a result of an incident case
investigation. Incident report might be cryptographically signed or have other
assurances of its integrity. Most incident investigations will result in the report
submitted to appropriate authorities (either internal or outside the company),
which might contain some or even all data associated with the case.
It is worthwhile to note that the term evidence is used throughout the chapter
indicates any data discovered in the process of incident response.