Automated Sample Processing

Automated Sample Processing

Schon Brenner Dmitry Gryaznov Joel Spurlock
Engineering Software Development Sr. Research Architect Virus Research Lead
Team Lead

McAfee AVERT
USA

Abstract
Long gone are the days when the number of viruses, trojans and other malware was counted in dozens, hundreds, even in
thousands. A few years ago the number of known unique pieces of malware exceeded 100,000, and everybody stopped keep-
ing the exact count.

These days up to 10,000 pieces of malware are added to McAfee malware collection, and their detection and removal to
McAfee products, in a single month. When it comes to the malware analysis workload in the same month, the 10,000 are
the tip of the iceberg of all the processed samples. There are numerous sources of samples for malware analysis: customer sub-
missions, malware patrols, honeypots, Web trawlers, and last but not the least - malware collections from other antivirus
vendors and researchers.

All those sources amount to well over 100,000 samples awaiting processing in an average month. Of course, it would be
simply impossible to process them all manually.

This paper offers a “behind the scenes” view of the McAfee AVERT automation. The automation consists of several logical
pieces: source analysis (prioritization, geographical analysis, denial of service protection, etc), sample elimination, static
and behavioral analysis, and content generation (malware descriptions, malware definitions, system training, etc). As part
of sample processing and system training, researchers interact with the automation to increase its processing capabilities.

The Problem

About ten years ago some computer antivirus research-
ers started worrying about a possible malware “glut”
problem. That is, a sustained situation when new viruses,
trojans, other malware and variants are appearing faster
than the antivirus industry can handle. At that time the
total number of all known viruses, trojans, etc. was about
10,000 and it was growing at a rate of several thousands
a year (see Figure 1). An average antivirus researcher can
fully process 8-10 new malware samples per day. That in-
cludes analyzing each sample using tools like unpackers, Figure 2 Samples added to McAfee AVERT malware col-
disassemblers, debuggers, etc.; running each sample in a lection 1999 - 2005
controlled environment on a dedicated physical or virtual
computer; if the sample proves to be malicious, creating of 8-10 samples per antivirus researcher, requires 20-25
and testing detection and removal routines for the mal- researchers doing nothing but full-time sample analy-
ware; and optionally creating and publishing a descrip- sis and processing. In reality, these numbers are just the
tion for the malware. Not a small job to do in less than tip of the iceberg of samples. For each sample eventually
an hour. classified as malware and added to the internal malware
collection, there are many more samples that had to be
processed. Some of them appeared to be not malicious,
some were corrupted, some were duplicate samples of the
same malware, and so on.

Let’s look at the sources of samples coming to McAfee
AVERT Labs. First, there are direct submissions of sam-
ples to AVERT from customers, sometimes from malware
authors and malware itself (e.g. a mass-mailing virus).
These days AVERT routinely receives 10,000-15,000 of
such submissions a week.
Figure 1 Growth in Malware 1988-1998
Then, there are malware collections from other antivirus
companies. It was realized long time ago that in order
From 1996 to 1997 the number of known viruses and
to succeed in the business of protecting customers from
trojans grew by about 6,000 in one year, or by about 23 a
malware, antivirus companies must share samples of new
day. An antivirus company back then had 3-5 antivirus re-
malware with each other, despite being competitors on
searchers, who often spent their time not only processing
the market. These samples have been shared mostly on
samples but also reverse engineering and supporting new
monthly basis and are generally known as “monthly col-
file formats, unpackers, archivers and often worked as
lections”. Ten years ago a typical monthly collection con-
antivirus engine developers as well. Thus, it is hardly sur-
tained several hundred samples, was several megabytes in
prising that some antivirus companies found themselves
size when archived and could be distributed via E-mail
at the limits of their sample processing capacity. Improve-
even over a dial-up modem connection. Today a single
ments in antivirus technologies (e.g. heuristic and generic
monthly collection contains thousands and tens of thou-
detection and removal) and growth of the antivirus com-
sands of samples, is hundreds of megabytes or even sev-
panies helped to alleviate the problem.
eral gigabytes in size, and takes several hours to download
over a broadband Internet connection. On an average
Today we count the total number of all known malware
month McAfee AVERT receives over 100,000 samples in
in hundreds of thousands and routinely see thousands of
monthly collections from about 20 other antivirus com-
new malware threats appearing in a single month. Dur-
panies.
ing the year 2005 over 55,000 (see Figure 2) new malware
samples were added to McAfee AVERT Labs malware
Since this means several thousand of new malware samples
collection. That means over 210 samples per day, which,
a day, some of which may be fairly urgent, many antivirus
at the processing rate
companies, as well as other antimalware entities, started

AVAR 2006 - AucklAnd

sharing new samples on a daily basis – so called “daily col- time; or, in other words, a system to teach, a system to
lections”. There are also other sources of frequent sample learn from.
submissions that fall into this category – for example, ser-
vices like VirusTotal, organizations like CERT, etc. On an Researcher develops
Consumes new new analysis and
average day in September 2006 McAfee AVERT received capabilities of analysis remediation capabilities.
and remediation.
2,000-3,000 samples in daily collections.

Yet another source of potentially malicious samples is ac- Automated
Research
tive monitoring of different networks like Usenet, Inter- Human
Research
net Relay Chat (IRC), peer-to-peer file sharing and so on,
known as AVERT Virus Patrols. Virus Patrols deal with
Produces automated Consumes automation
many thousands samples on an average day. analysis and remediation analysis and remediation
results. results

Altogether, the above listed sources amount to several
Figure 3 Flow diagram of human / automation system
hundred thousands samples in an average month, or tens
of thousands samples on an average day. Processing all
Figure 3 describes the flow by which a researcher can both
those samples manually would require hundreds if not
use the system to automate their work as well as educate
thousands of malware researchers which is, of course,
the system to provide new and enhanced analysis and
unfeasible in the real world. In reality, the overwhelming
remediation capabilities. Furthermore, a researcher can
majority of these thousands of samples is processed by
enhance the system to provide that would highlight or
means of automation.
prove or disprove theories that would further their capa-
bilities as a researcher.
The Solution
Automating the research
The nature of research at a general level is quite sequen-
tial. A long list of discrete task can be executed in pre-
While the automation system can be extend to teach and
dictable ways that makes researchers lives significantly re-
learn from, at any given point in time, its capabilities are
petitive. These repetitive tasks are the first candidates for
static and are centered on a process of sample analysis and
automation. Since 1998 McAfee Avert has been employ-
remediation. While there are many discrete, the automa-
ing automation to automate the most repetitive aspects
tion can be broken down into several separate high level
of sample analysis. As time progressed and additional
stages.
tasks were identified, automation was enhanced to not
only provide sample analysis but also automation resolu- ■ Triage
tion and response to customers. With the automation
systems of present, McAfee Avert is at most able to close ■ Examination
approximately 90% of all samples; leaving the remaining
■ Observation
10% to be handled manually by McAfee Avert research-
ers. While 10% doesn’t seem like an egregious number,
■ Diagnosis
as time has progressed and the growth of malware has ac-
celerated in both volume and complexity, this remaining
■ Resolution
percentage has become intolerable for both existing and
future research staff.
Before automation will progress from one stage to the
next it attempts to diagnosis the inquiry to a resolution.
If the rate of malware growth remained constant, further
If diagnosis is definitive then one or many solutions could
refinements of automated malware analysis would be suf-
be provided and then a response is given to the client. If
ficient, however in reality, with malware growth acceler-
at the end of all stages a definitive diagnosis cannot be
ating in volume and complexity, combatant automation
found, escalation to researchers is required to assist auto-
must in turn also grow with equal or greater potency. In
mation in finding a diagnosis.
order to meet this objective the automation of present
must provide data to assist researchers in identifying re-
lationships among the data as well as providing a means
by which researchers can develop and extend automation
to automate future discrete and repeatable tasks in real


■ Sample attributes discovery
Enough Enough
Inquiry Triage Examination Observation
Info? Info?
■ Sample dissection (unpacking)
N N
1:N Symptoms Priority? Looks like Replicate
Known clean? Competitors Emulate
1:N Samples
Known dirty? Suspicious attr. Disassembly
Last seen?

■ Competitor detections
Y
Y

Response Resolution Diagnosis
Definitive?
■ Looks like (string / byte comparison to known da-
Question VIL Description Innocent
1:N Solution Pup
Driver
Malware
Add to known
tabases)
clean / dirty Unknown
Y

N Escalation

When the sample is being inspected, the goal is to deter-
Researcher
notification
mine as much information as possible through static sam-
ple evaluation. During this process new information can
Figure 4 Inquiry / Response flow be discovered through unpacking of samples, extraction
of scripts from html, decryption of code or scripts. Some
Inquiry / Response
of this information can actually be considered samples as
Clients interface with automation through an inquiry well.
and response mechanism. For every inquiry that a client
has, the client can provide a description of the problems Sample attributes is a key value pair, such as current scan
or symptoms that they are experiencing as well as submit- detections, file type, file version info, packer information,
ting one or many samples (including but not limited only high level language, capturing of resources or various
to files). Additionally, once a diagnosis can be made, ei- checksums on portions of the sample.
ther by automation or human, a response is generated and
Competitor detections are interesting in many ways. Al-
delivered to the client.
though they are not be used as a sole means of detection,
Triage they provide supporting information for classification al-
The intent of this stage is to identify what we know about gorithms. If a sample is automatically classified as a Tro-
the sample in order to prioritize the inquiry. In some jan, and competitors detect, escalation to researchers can
cases being able to identify the sample can beget an im- occur in order to validate automated conclusions. Like-
mediate diagnosis and resolution of the inquiry. If diag- wise, competitor information can be used to reinforce an
nosis can’t be made then the inquiry will be queued for automated conclusion.
examination.
“Looks like” is a complex evaluation process which con-
siders packer type, comparison of resources, strings and
■ Known status is determined first, by comparing the
specific byte sequences to known databases of samples
samples to a set of known clean files through a hash-
(both clean and dirty). The history data bases used in
ing mechanism (md5, etc). The known set of clean
the “Looks like” process are sorted by classification, file
files can be updated through partner programs,
type, and other attributes. They can contain information
adding binaries from trusted sources, or through
about all of the samples previously evaluated and as such
manual analysis. Secondly, the sample can be com-
are very large. Filtering is done using the attributes such
pared to the known set of known dirty files again
as type, size, etc, as well as competitor information.
added through similar mechanism as the clean set.
Observation
■ Prioritization is determined through a combination
of identifying a samples origin as well as its known If a diagnosis through examination has still not been
status. This prioritization is then use the through achieved, then the behavior of the sample should be eval-
out the system. uated. Sample behavior is evaluated to determine ‘what
does the sample do’. This can be done by evaluating the
■ Severity can also be determined through a combi- functions or byte sequences of the sample, emulating the
nation of sample origin and priority. This is also execution of the sample, or replicating the sample in a
used throughout the system. physical or virtual environment.
Examination From the behavior, specific features of the sample can be
If triage is not capable of diagnosis and resolution then determined. API analysis or execution traces can be used
examination is required. From an automated perspective to automatically determine that a sample connects to an
IRC server, sends an email, reads user stored password in-
the examination process considers the following:
formation, etc. Disassembly analysis can determine code

sequences which exist that indicate behaviors. Exploit if a scan detection occurs and no competitors detect, it is
code, code for writing to explorer memory, and code for possible that the detection of the sample is a false.
downloading files are examples of this.
Trojan / Virus
Any sample replication or emulation system should make Trojans and viruses are the easiest samples to classify in
some effort to emulate resources to encourage the sample an automated sense. Trojans and viruses tend towards
to exhibit more behavior. These environments provide the more egregious activities, so it is easier to make some
different network services (IRC, SMTP, etc), AV binaries determination. In the simplest terms, this sample does
for process termination, and installed software. Using something that is deemed as ‘malicious’, therefore it is a
common or more vulnerable platforms / software is im- trojan, virus, etc. This is most evident through behavioral
portant. Windows XP, Windows 2000 server, and Of- evaluation, but can also be discovered during file inspec-
fice 2000 with no patches or service releases are generally tion. Examples may include file infection, downloading
more vulnerable. or dropping of known dirty files, installation of rootkits,
and containing exploit code sequences.
As a function of evaluating the sample behavior, new
samples can be discovered (dropped or downloaded files, Automated algorithms can identify ‘this is a virus mass
urls, etc). mailer’ fairly easily. It is more difficult to classify a Tro-
jan or virus by name and family. A historical comparison
diagnosis
or predictive algorithms must be utilized. The algorithm
Diagnosis is primary achieved through classification of must be able to determine that the sample is a ‘bagle’ mass
the sample or samples analyzed by the system. Classify- mailer and not a ‘netsky’ mass mailer. This information is
ing a sample is defined as automation making an assertion gathered during the looks like sample inspection, and can
about the sample. The assertion can be ‘clean’ or ‘trojan’, be reinforced with competitor information.
or ‘virus’, etc. After each step, classification can be at-
tempted. If enough data has been captured in order to Potentially Unwanted Programs (PUP)
classify a sample or the sample has been seen before and a Classifying a sample as a potentially unwanted program
previous classification can be used, then automation can can be done based on several factors. There are some dis-
proceed to generating a resolution. If not enough data is tinctions that can be made about PUPs which separate
available to make an assertion, then the sample will con- them from more benign Trojans. These ‘positive’ attri-
tinue being evaluated. When all work on the sample and butes can include: installers, a license agreement, unin-
any discovered samples are complete, and enough infor- stallers, a website, signed binaries, and some user interface
mation to classify is not available, then an ‘unknown’ clas- (toolbar component, dialog, etc).
sification is selected as a default.
PUPs will not have overtly malicious activities, but ques-
Innocent Files tionable activities. This separates them from innocent
Automatically classifying files as clean can be dangerous. files or other legitimate applications. These can include
If a virus or trojan is incorrectly identified as a clean sam- displaying advertisements out of the context of the main
ple by an automated system, then the solution provided application window, sending personal information, redi-
will corrupt the integrity of the virus definitions as well recting search criteria, modifying the start page, etc.
as erode the customer confidence. Classification of sam-
ples as ‘innocent must therefore be reviewed by a human. It is the presence of the positive attributes in combination
There are three ways to classify innocent files. First, the with the questionable activities that allows for automatic
detection of PUPs. As with viruses and Trojans, historical
sample can be identified as a junk file.
comparison (or looks like) can be used to specify a name,
These are generally text or log files which meet some strin- and competitor information can be used to reinforce the
gent criteria, or some heavily corrupted PE file. Secondly classification.
the sample can be identified as an innocent file. This can
be done through a combination of string / byte analysis,
comparison to history databases, and behavior analysis. Guilt by Association
Usually the file will be compiled in specific ways, have ap- During the sample inspection phase and the replica-
propriate version information, be signed in some way, is tion phase, new samples can be discovered which also
not packed, etc. Finally, a false can be identified as a com- need analysis. Each new sample gets processed recur-
bination of competitor and signature scans. For example, sively through inspection, evaluation, and classification.


Through evaluation of the new samples, yet more infor- resource can be extracted from the sample and compar-
mation can be discovered (and so on). Grouping these to- ing them to a database of known information in order to
gether and relating them allows for classification of more minimize the chance of creating a false positive.
samples than just evaluating the single sample. Some
Generating repair is done by evaluating the sample behav-
simple examples:
ior and providing instructions in the signature to reverse
■ File downloads another file, which is classified as a the effects on the system. Care must be taken to only re-
trojan. The first file is a ‘Trojan Downloader’. move effects that the sample caused where the previous
value is known or a safe default can be selected. Removal
■ File drops another file, which is classified as a PUP. can include tasks from removing or editing a registry key
The first file can then be classified as a dropper or to modifying the network stack to remove a layered ser-
installer of the PUP. vice provider.

Unknown Methods of generation have different risk levels which can
The Unknown classification is a catch all. In effect, it be evaluated over time. Some methods, like strings, will
means that the automated system is unable to make a de- almost always require a human to validate before com-
termination and will require some human to participate mitting the signature to the definitions. Each method of
generating a signature can be trusted over time, removing
in the classification process.
the human from the validation loop. Methods to gener-
ate signatures based on packed data inherently carry less
Resolution
A solution which is generated depends on the source risk than generic signature generation methods.
of the sample, its prevalence, its classification, and what
Automation Teach Thyself
work has already been completed. Solutions cannot be
generated unless a classification has been determined. So- As a function of sample inspection, behavioral evaluation,
lutions can include public or private descriptions of the classifying samples, creating solutions, and interacting
sample behavior, generating detection and repair, adding with humans, large amounts of data are constantly being
the sample information to a known clean or known dirty generated. This can include detection names for samples,
database, or responding to the submitter. Most of the hashes for clean samples, strings and resources to add to
solutions are self explanatory. The automatic generation historical databases, or API traces. Each one of these
of signatures to detect and repair has caveats which beg pieces of data is placed in various data stores, which can
then be called upon when processing the next sample.
further explanation.
Additionally, algorithms can make use of this constantly
To generate automated detection for a sample has a level
updated data in order to come to a diagnosis. If a new
of risk. Detecting a sample incorrectly can lead to falses
sample is processed, it may provide some piece of correlat-
(e.g. hitting on packer code), or inefficient detection
ing information that will cause other samples to then be
which will bloat malware signatures. Additionally, if a
classified. Researchers can also update or create new algo-
sample is parasitically infected by some virus, automatic
rithms to utilize this data, as well as create new data gath-
generation is not safe, and the task of generating detec-
ering methods and processes. In this way, automation can
tion and repair must be performed by a human.
teach itself by discerning conclusions and generating data
The vast majority of samples which will require signature to be correlated for the next sample.
generation are some new packed or encrypted version of a
sample which has been evaluated before. Generating de- Cost Benefit Analysis
tection for packed samples can be done with minimal risk
using a data driven generation technique. A safe method
of detection is selected based on the attributes detected While automation provides a means by which to distill
in the sample inspection. For example, a UPX packed file an inquiry to a resolution, it still suffers from the same ru-
will have detection generated in one specific way versus a dimentary bottlenecks that researchers face with sample
analysis. That is the costs associated with multi-process-
Morphine encrypted file.
ing (handling simulations related or non-related inquires
Generating signatures for unpacked files is also possible at a time) and intensive recursive, aggregate and/or search
but carries with it some level or risk. These detections are algorithms in order to form some concrete diagnosis; al-
inherently generic, and usually based on string sequences beit at a much higher bound.
or resource signatures in the sample. The correct string or


120.00% 100.00%
100.00%

100.00% 100.00%
10.00%

80.00%
1.00%
Pending Diagnosis
60.00%
Diagnosis Cost (min)
0.18%
0.10%
40.00%

26.56%
0.01%
20.00% 0.01%
11.78%

2.40%
0.00% 0.00%
Received Triage Examination Observation

Figure 5 Estimative analysis of sample volume vs. comple-
tion time.

Figure 1.6 depicts an estimative analysis of sample vol-
ume (pending Diagnosis) vs. completion time (Diagnosis
Cost) per stage as samples progress through the malware
analysis and remediation process.

On average, McAfee Avert receives 2067 samples from the
field a day. As samples progress though the triage stage,
samples are filtered, prioritized and diagnosis in a matter
of milliseconds based on the hash of the sample and other
previous diagnosis’s reducing the volume of samples re-
quiring further analysis by 73%. T

he remaining 27% then progress though the examination
stage where samples are scanned with a set of AV scanners
and tools in a matter of seconds / minutes, further diag-
nosing 44% of the remaining samples, leaving 12% to be
handled by the Observation stage.

During the observation stage samples are monitored for
behavior and the results are analyzed to diagnose another
20% of the remaining samples; however this stage is the
most expensive and take upward to 15 minutes to com-
plete.

Overall, the process eliminates 97% of all samples received
by McAfee Avert, leaving only the remainder to humans
to diagnose manually.

With the current inflow of samples, the fleet of research-
ers required to manually process this load is calculated in
hundreds, however with the benefit of the malware analy-
sis and remediation process, the number of researchers re-
quired by McAfee Avert is approximately 50-75; a much
more tolerable number.


Automated Sample Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (16)

Similar to Automated Sample Processing

Similar to Automated Sample Processing (20)

Recently uploaded

Recently uploaded (20)

Automated Sample Processing