9. Malware | Many faces
unlike real thieves, malware can be duplicated
not only duplicated, but also modified
all this is done by machines
too much work to judge each one manually
•
•
•
•
12. Finding similar files | File vector
each executable file is represented by a feature vector
the PE format is complex, so we keep exactly one
version of the extractor code (C++)
the vector comprises static and dynamic features, the
exact content is proprietary
•
•
•
Database record
• One record = constant vector of over 100 attributes
• the “file fingerprint”
• Each attribute has a data type and semantic
Attribute Data Type Semantic
sha256 32 byte array CHECKSUM
pe_sect_cnt uint16_t VALUE
pe_sect_rawoff_entry uint32_t OFFSET
• The complete contents of the vector are kept secret
• static and dynamic features of PE executables
13. Finding similar files | Distance
sum of partial distances
each distance operator assigned manually
weights assigned manually to equalize contribution
•
•
•
Nearest neighbor query
• Compound distance function
• Data type and semantic determine partial dist. func.
Data Type Semantic Partial distance function
32 byte array CHECKSUM RETURN_ZERO
uint16_t VALUE EQUAL_RET32
uint32_t OFFSET LOG
• Each partial distance function = one kernel function
• Over 100 kernels for every NN query
• Intermediate results kept in the “Scratchpad”
14. Finding similar files | Data
~60 M data points
sparse and well separated
(in many cases)
•
•
15. Finding similar files | Implementation
we started with GPUs
their high memory throughput allows “naive”
implementation and rapid prototyping
column-oriented database
•
•
•
18. Classification | Optimizations
scaling and HW problems with GPUs
we invested in algorithmic optimizations:
VP-tree, distance bounded search
hand optimized distance function (assembly)
CPU version is ~100x faster
•
•
•
•
19. Classification | Deployment
→
FileSHAandu
ser id →
←Fileprevale
nce ←
←
Fileclass
ification ←
→
Filefinger
print →
← Generic detections ←
↑ File classifications and
Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
20. Classification | Deployment
→
FileSHAandu
ser id →
←Fileprevale
nce ←
←
Fileclass
ification ←
→
Filefinger
print →
← Generic detections ←
↑ File classifications and
Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
21. Classification | Deployment
→
FileSHAandu
ser id →
←Fileprevale
nce ←
←
Fileclass
ification ←
→
Filefinger
print →
← Generic detections ←
↑ File classifications and
Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
22. Rule generator
detect more variants in the wild
(our) rule is a conjunction of several conditions
known as Win32:Evo-Gen
completely different optimization problem than
classification - still uses the GPU
•
•
•
•