Looking back at 2015, it’s hard to dispute that the security industry has been
flooded with Endpoint Detection and Response (EDR) products. Walk the
sponsor floor at any conference, or sample the whitepapers and marketing
pitches from any vendor web site, and you’ll see the same claims repeated ad
nauseum: “Prevent, Detect, and Respond” at “enterprise scale” in “real time”.
Throw in obligatory references to “anomaly detection”, “threat intelligence”,
and “APTs” for good measure.
It’s no wonder that so many organizations struggle to down-select and
evaluate vendors. Requests for Proposals yield vague or exaggerated
responses. Demos and small proof-of-concept labs with staged intrusion
scenarios can make any product look effective. Like any enterprise solution,
EDR tools inevitably show their true strengths and weaknesses over time, and
only once fully deployed in real, complex, messy networks.
My role at Tanium might preclude me from claiming a truly vendor-neutral
point of view; however, having spent over a decade as a consultant
conducting security assessments, incident response investigations, and
remediation efforts, I continuously try to remain mindful of the criteria that
would have mattered most to me as a practitioner.
In this blog post, I’d like to focus on three foundational attributes that impact
any EDR solution’s effectiveness: the scope of data it provides, performance
Assessing what matters in an EDR solution
and scalability, and flexibility.
What scope of data does the solution provide?
The scope and timeliness of data that an endpoint product can search,
analyze, or collect represents the absolute core of its capabilities. Nearly every
EDR product now claims to provide “enterprise-wide search”, but the critical
question is, “Of what information?” Many solutions make significant tradeoffs
in scope of data available to bolster scalability or performance. Imagine if
Google only allowed users to search the content of sites that it indexed in the
30 days; or conversely, if it provided an unlimited search timeframe, but only
across the title tag for popular web pages?
Endpoint data can be roughly divided into two domains: historical and
current-state. Why are both important? Products that continuously record
endpoint telemetry, such as file I/O, network connections, process execution,
log-on events, or registry changes, have become increasingly popular for
incident detection and response. As I previously blogged1
during the launch
of Tanium Trace, this capability can accelerate and simplify the effort required
to triage a lead, generate alerts, or investigate a system. It preserves and
enriches artifacts that might otherwise be lost to gaps in forensic evidence,
and reduces the cadence of data retrieval needed to identify and retain short-
Yet exclusively relying on a sliding window of historical data incurs significant
limitations - particularly when hunting-at-scale. Such solutions restrict both
the timeframe of what’s retained and the breadth of data available for alerting
and analysis. If it’s not recorded, you can’t find it.
To complement this narrower scope of information, effective incident
detection and response also requires on-demand access to current-state data
from all systems. That means having the ability to search for or collect volatile
artifacts that are happening now. There are countless examples I've
encountered in investigations: Where is a compromised local administrator
account currently logged in? What systems currently have a malicious DLL
loaded in memory?
Current-state data also encompasses latent artifacts that have not recently
changed, or were out of scope for historical preservation, but may be crucial
to scoping an incident. Consider the need to search across the environment
for any type of file “at rest” by name or hash; a registry value that hasn’t been
touched in a year; or a more esoteric forensic artifact like data in the WMI
repository. What if you need to install the solution after systems have already
been compromised? Working with a constrained scope of data inevitably
leads to blind spots and investigative dead-ends.
What is the solution’s performance and scalability?
Nearly every EDR vendor promises some variant of “real-time” speed that
scales to “tens of thousands” of endpoints. In practice, it’s unfortunately easy
for clever hand-waving to disguise a product’s true performance and
scalability limitations, especially if an evaluation process is limited to small test
labs. How can an organization ensure an EDR product is able to perform well
enough to meet their use-cases? The key is to take a more holistic approach
to assessing speed and scale.
First, consider the modes of interaction provided by the solution. Passive
workflows include ad-hoc searching (“Where is this hash?”), detection and
alerting (“Has an IOC hit or rule triggered?”), and data collection for anomaly
analysis (“Obtain all autoruns, stack by frequency of occurrence”). Active
workflows entail changing systems, be it enforcing a quarantine, killing a
process, removing malware, or fixing a configuration vulnerability.
Next, overlay the scope of data made available to each of these modes of
interaction. Does it include historical activity? Current activity? Latent files or
other artifacts at-rest? Finally, assess the performance and scalability of the
solution along each of these sets of criteria. Some solutions may provide
enterprise-wide access to a set of historical data, but are unable to easily
work with the current or latent data at-scale (and vice-versa). Do queries or
actions take seconds for some tasks, and hours for others?
Organizations should also evaluate the infrastructure footprint and cost
incurred to operate the solution at the desired level of performance and scale.
For on-premises solutions that scale horizontally, maintenance costs rise and
effectiveness declines over time. More servers mean more points of failure,
and the need to re-architect and balance resource utilization as environments
In contrast, cloud-based solutions need to govern the volume of data
transmitted over the internet. This leads to reliance on client-side filtering and
trigger mechanisms that can curtail the scope of endpoint data available on-
demand or retroactively. Depending on your organization’s use cases, those
concessions may be unacceptable.
How flexible is the platform?
We’ve already stressed that IR requires fast, scalable access to a broad set of
endpoint data. Every EDR tool is capable of working with “core” forensic
evidence: process activity, file system metadata, network connections, account
activity, OS-specific artifacts like the Windows registry, and so on. But just as
attacker techniques rapidly evolve, so too do the sources of evidence
introduced by new operating system updates, applications, or researcher
discoveries. An EDR solution’s flexibility directly impacts how quickly it can
incorporate these new findings.
When comparing products, many organizations simply ask for a list of
features and capabilities. I’d suggest going a step further to understand how
the product has been updated in the past, and how it’s poised to continue
maturing. That can include assessing the following:
Ask to review the product’s change log for the past year to assess the
pace of development. What types of new features have been added –
and how quickly?
Consider how the software is designed, and whether that lends itself to
readily integrating new sources of data, or interacting with endpoints in
new ways. How much control do customers have? What requires a
vendor-supplied agent update?
What is the state of the user community? Are other customers sharing
capabilities that go beyond what’s “out-of-the-box”?
Finally, consider how thoroughly the product addresses the “Response”
portion of “EDR”. Tactical remediation features, like killing a process or
isolating a machine on the network, are commonplace among most tools in
this space (though they may differ in scale or ability to orchestrate such
actions). But just as important – and often neglected – is whether the solution
can actually protect systems, complement other preventative controls, and
reduce endpoint attack surface. Key capabilities in this area include enforcing
control over what is allowed to execute or communicate over the network,
assessing and hardening security configuration settings, and maintaining
patch levels for OS and 3rd
party applications. Simply put, if an EDR solution
only makes you better at quickly detecting and responding to attacks, its not
actually making your organization more resilient - it's not helping you break
out of the cycle of re-compromise.
Testing, purchasing, and integrating any enterprise software is never an easy
task. And each iteration of rip-and-replace for a product that failed to meet
expectations brings significant operational risks and expenses. When
considering an EDR solution, I hope that some of the points outlined in this
blog post can help your organization form the right set of evaluation criteria
to identify whichever product is a best fit for your needs.