Bilar (2007) callgraph properties of executables

671 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
671
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Bilar (2007) callgraph properties of executables

  1. 1. 1Callgraph properties of executables andgenerative mechanismsDaniel Bilar ∗ sequent generation using various techniques, suchWellesley College as junk insertion, semantic NOPs, code transposi-Department of Computer Science tion, equivalent instruction substitution and regis-Wellesley, MA 02481, USA ter reassignments [11][52]. The net result of theseE-mail: dbilar@wellesley.edu techniques is a shrinking usable “constant base” for strict signature-based detection approaches. Since signature-based approaches are quite fastThis paper examines the callgraphs of 120 maliciousand 280 non-malicious executables. Pareto models (but show little tolerance for metamorphic andwere fitted to in-degree, out-degree and basic block polymorphic code) and heuristics such as emula-count distribution, and a statistically significant differ- tion are more resilient (but quite slow and mayence shown for the derived power law exponent. Gen- hinge on environmental triggers), a detection ap-erative mechanism are discussed and a two-step op- proach that combines the best of both worldstimization process, based on resource constraints and would be desirable. This is the philosophy behindrobustness tradeoff (PLR HOT mechanism) sketched a structural fingerprint. Structural fingerprints areto account for the structure of the executable. statistical in nature, and as such are positioned asKeywords: Executables, Callgraph, HOT process, ‘fuzzier’ metrics between static signatures and dy-Graph-structural fingerprint, Malware, PLR model namic heuristics. The structural fingerprint inves- tigated in this paper for differentiation purposes is based on some properties of the executable’s call-1. Motivation graph. I also propose a generative mechanisms for the callgraph topology. All commercial antivirus (AV) products rely onsignature matching; the bulk of which constitutesstrict byte sequence pattern matching. For mod- 2. Generating the callgraphern, evolving polymorphic and metamorphic mal-ware, this approach is unsatifactory. Clementi re- Primary tools used are described in more detailscently checked fifteen state-of-the-art, updated AV in the appendix.scanner against ten highly polymorphic malwaresamples and found false negative rates from 0- 2.1. Samples90%, with an average of 48% [12]. This develop-ment was already predicted in 2001 [54]. Polymor- For non-malicious software, henceforth calledphic malware contain decryption routines which ‘goodware’, sampling followed a two-step process:decrypt encrypted constant parts of the mal- I inventoried all PEs (the primary 32-bit Windowsware body. The malware can mutate its decryp- file format) on a Microsoft XP Home SP2 lap-tors in subsequent generations, thereby complicat- top, extracted uniform randomly 300 samples, dis-ing signature-based detection approaches. The de- carded overly large and small files, yielding 280crypted body, however, remains constant. Meta- samples. For malicious software (malware), sevenmorphic malware generally do not use encryp- classes of interest were fixed: backdoor, hackingtion, but are able to mutate their body in sub- tools, DoS, trojans, exploits, virus, and worms. The worm class was further divided into Peer- * Corresponding author: Daniel Bilar, Department of to-Peer (P2P), Internet Relay Chat/Instant Mes-Computer Science, Wellesley College senger (IRC/IM), Email and Network worm sub-AI CommunicationsISSN 0921-7126, IOS Press. All rights reserved
  2. 2. 2 D. Bilar / AICom L TEX 2ε Style sample A (a) Example: Callgraph (b) Example: Control Flow Graph (c) Example: Basic Block (CFG) Fig. 1. Graph structures of an executableclasses. For an non-specialist introduction to ma- e.g. a directed graph which we call flowgraph).licious software, see [51]; for a canonical reference, These functions call each other, thus creating asee [53]. Each class (subclass) contained at least larger graph where each node is a function and the15 samples. Since AV vendors were hesitant for edges are calls-to relations between the functions.liability reasons to provide samples, I gathered We call this larger graph the callgraph. We re-them from herm1t’s (underground) collection and cover this structure by diassembling the executableidentified compiler and (potential) packer meta- into individual instructions. We distinguish be-data using PEiD. Practically all malware samples tween short and far branch instructions: Shortwere identified as having been compiled by MS branches do not save a return address while farC++ 5.0/6.0, MS Visual Basic 5.0/6.0 or LCC, branches do. Intuitively, short branches are nor-and about a dozen samples were packed with var- mally used to pass control around within one func-ious versions of UPX (an executable compres- tion of the program, while far branches are usedsion program). Malware was run through best-of- to call other functions. A sequence of instructionsbreed, updated open- and closed-source AV prod- that is continuous (e.g. has no branches jumpingucts yielding a false negative rate of 32% (open- into its middle and ends at a branch instruction) issource) and 2% (closed-source), respectively. Over- called a basic block. We consider the graph formedall file sizes for both mal- and goodware ranged by having each basic block as a node, and eachfrom Θ(10kb) to Θ(1MB). A preliminary file size short branch an edge. The connected componentsdistribution investigation yielded a log-normal dis- in this directed graph correspond to the flowgraphstribution; for a putative explanation of the un- of the functions in the source code. For each con-derlying generative process, see [38] and [31]. All nected component in the previous graph, we cre-400 samples were loaded into the de-facto indus- ate a node in the callgraph. For each far branchtry standard disassembler (IDA Pro [24]), inter- in the connected component, we add an edge toand intra-procedurally parsed and augmented with the node corresponding to the connected compo-symbolic meta-information gleaned programmati- nent this branch is targeting. Fig. 1 illustrate thesecally from the binary via FLIRT signatures (when concepts.applicable). I exported the identified structures ex- Formally, denote a callgraph CG as CG =ported via IDAPython into a MySQL database. G(V, E), where G(·) stands for ‘Graph’. Let V =These structures were subsequently parsed by a F , where F ∈ normal, import, library, thunk.disassembly visualization tool (BinNavi [16]) to This just says that each function in CG is either agenerate and investigate the callgraph. ‘library’ function (from an external libraries stati- cally linked in), an ‘import’ function (dynamically2.2. Callgraph imported from a dynamic library), a ‘thunk’ func- tion (mostly one-line wrapper functions used for Following [17], we treat an executable as a graph calling convention or type conversion) or a ‘nor-of graphs. This follows the intuition that in any mal’ function (can be viewed as the executablesprocedural language, the source code is structured own function). Following metrics were program-into functions (which can be viewed as a flowchart, matically collected from CG
  3. 3. D. Bilar / AICom L TEX 2ε Style sample A 3– |V | is number of nodes in CG, i.e the function class metric Θ(10) Θ(100) Θ(1000) count of the callgraph Goodware r 0.05 -0.017 -0.0366– For any f ∈ V , let f = G(Vf , Ef ) where b ∈ IQR 12 44 36 Vf is a block of code, i.e each node in the Malware r 0.08 0.0025 0.0317 callgraph is itself a graph, a flowgraph, and IQR 8 45 28 each node on the flowgraph is a basic block Table 1 Correlation, IQR for instruction count– Define IC : B → N where B is defined to be set of blocks of code, and IC(b) is the number of instructions in b. We denote this function 2.3. Correlations shorthand as |b|IC , the number of instructions in basic block b. I calculated the correlation between in and– We extend this notation | · |IC to elements of outdegree of functions. Prior analysis of static V be defining |f |IC = b∈Vf |b|IC . This gives class collaboration networks [45][41] suggest an us the total number of instructions in a node anti-correlation, characterizing some functions as source or sinks. I found no significant correlation of the callgraph, i.e in a function. between in and outdegree of functions in the dis-– Let d+ (f ), d− (f ) and dbb (f ) denote the in- G G G assembled executables (Fig. 2). Correlation intu- degree, outdegree and basic block count of a itively is unlikely to occur except in the ‘0 out- function, respectively. degree’ case (the BinNAvi toolset does not gen- erate the flowgraph for imported functions, i.e. an imported function automatically has outde- Malware: Scatterplot p vs r gree 0, and but will be called from many other functions). Additional, I size-blocked both sample 0.4 0.3 groups into three function count blocks, with block 0.2 criteria chosen as Θ(10), Θ(100) and Θ(1000) func- 0.1 tion counts to investigate a correlation between No statistically significant correlation between indegree and outdegree of callgraph instruction count in functions and complexity of nodes for ≈ 85% of executables the executable (with function count as a proxy). r 0 15% of executables with p-vals ≤ 0.05 −0.1 and 8% with p-vals ≤ 0.01 show weak to very weak correlation Again, I found no correlation at significance level −0.2 ≤ 0.001. Coefficient values and the IQR for in- −0.3 struction counts (a spread measure, the difference −0.4 between the 75th and the 25th percentiles of the sample) are given in Table 1. The first result cor- −7 −6 −5 −4 −3 −2 −1 0 10 10 10 10 10 10 10 10 p (a) Malware: p vs rin,out roborate previous findings; the second result at the phenomenological level agrees with the ‘refactor- 0.4 Goodware: Scatterplot p vs r ing’ model in [41], which posits that excessively 0.3 No statistically significant correlation long functions that tend to be decomposed into between in and outdegree of callgraph nodes 0.2 for ≈ 94% of executables smaller functions. Remarkably, the spread is quite low, on the order of a few dozen instructions. I will discuss models more in section 4. 0.1 r 0 −0.1 2.4. Function types −0.2 6% of executables with p-vals ≤ 0.05 and 3% with p-vals ≤ 0.01 Each point in the scatterplots in Fig. 3 repre- sents three metrics for one individual executable: −0.3 show weak to very weak correlation −0.4 10 −7 10 −6 10 −5 10 −4 p 10 −3 10 −2 10 −1 10 0 Function count, and the proportions of normal function, static library + dynamic import func- (b) Goodware: p vs rin,out tions, and thunks. Proportions for an individual Fig. 2. Correlation Coefficient rin,out executable add up to 1. The four subgraphs are parsed thusly, using Fig. 3(b) as an example. The x-axis denotes the proportion of ‘normal’ function,
  4. 4. Goodware: Scatterplot of function type proportions 4 Goodware: Scatterplot 1 10 1 0.8 0.8 Libraries and Imports 3 10 0.6 0.6 Thunk 0.4 0.4 2 10 0.2 0.2 4 D. Bilar / AICom L TEX 2ε Style sample A 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 Normalons Goodware: Scatterplot of function type proportions Goodware: Scatterplot of function type proportions Goodware: Scatterplot of function type proportions Goodware: Scatterplot of function type proportions 4 Goodware: Scatterplo 4 4 4 4 10 1 10 1 10 1 1 10 10 Malware: Scatterplot of function type proportions Malware: Scatterplot 0.9 1 10 4 11 0.8 0.8 0.8 0.8 0.8 Libraries and Imports Libraries and Imports Libraries and Imports Libraries and Imports 0.7 0.8 0.8 3 3 3 3 3 10 10 10 10 10 Libraries and Imports 0.6 0.6 0.6 0.6 3 0.6 10 Thunk 0.5 0.6 0.6 0.4 Thunk 0.4 0.4 0.4 0.4 2 2 2 0.4 0.3 2 2 0.4 0.2 10 10 10 102 10 0.2 0.2 0.2 10 0.2 0 0.2 0.1 0.2 0 0 0 0 0.5 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 00.2 0.2 0.4 0.4 0.6 0.6 0.8 0.81 1 1 Normal Normal 0 Thunk Normal 0 Thunk 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 (a) GW:Norm vs Lib+Imp (b) GW:Norm vs Thunk (c) GW:Thunk vs Lib+Imp Normalons 4 Goodware: Scatterplot of function type proportions 4 Goodware: Scatterplot of function type proportions 4 Goodware: Scatterplot of function type proportions 4 10 10 1 10 10ns Malware: Scatterplot of function type proportions Malware: Scatterplot of function type proportions Malware: Scatterplot of function type proportions Malware: Scatterplot of function type proportions 4 Malware: Scatterplot 10 4 11 10 4 0.9 1 10 4 1 11 10 10 4 0.8 0.9 0.8 0.8 Libraries and Imports Libraries and Imports Libraries and Imports 0.8 0.7 0.8 0.8 0.8 3 3 3 3 1 10 10 10 10 Libraries and Imports Libraries and Imports 0.6 0.6 Libraries and Imports 3 3 0.6 3 0.7 3 3 10 10 10 10 10 0.6 0.5 0.6 0.6 0.6 0.8 Libraries and Imports 0.4 0.4 Thunk 0.4 0.5 0.6 2 0.4 0.2 2 0.4 0.3 2 0.4 0.4 0.2 2 102 102 102 2 102 10 10 10 10 10 0.4 0.2 0.3 0 0 0.2 0 0.2 0.1 0.2 0.2 0 0 0 0.2 0.5 0.5 0.5 0.5 0 0.1 1 1 0 0.2 0.4 0.6 0.8 1 1 1 Thunk Normal Thunk Normal 0 0 Thunk 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 00.2 0.2 0.4 0.4 0.6 0.6 0.8 0.81 1 1 0.8 0.6 Normal Normal Thunk Normal (d) MW:Norm vs Lib+Imp (e) MW:Norm vs Thunk (f) MW:Thunk vs Lib+Impns Malware: Scatterplot of function type proportions Malware: Scatterplot of function type proportions Malware: Scatterplot of function type proportions 10 4 10 4 Fig. 3. Scatterplot of function type proportions 1 10 4 10 4 0.9 m and the y-axis the proportion of “thunk” func- powerlaw of the form Pd (f ) (m) ∼ mαd (f ) e− kc , 0.8 1 1 Libraries and Imports 3 0.7 3 3 3 10 tions in the binaries. The color of each point in- 0.8 10 0.6 where kc indicates the end of the power law regime.10 10 0.8 Libraries and Imports Libraries and Imports dicates |V |, which may serve as a rough proxy 0.6 0.5 Shorthand, I call αd (f ) for the respective metrics 0.6 2 for the executable’s size. The dark red 0.4 point at αindeg , αoutdeg and αbb . 2 2 2 10 0.4 10 10 0.4 10 (X, Y )= (0.87, 0.007)0 is endnote.exe, since it 0.3 Figs. 4(a) and 4(b) show pars pro toto the 0 fit- 0.2 is0.2 only goodware binary with functions count the 0.5 0.1 ting procedures for our 400 samples. The plot is 0.2 0.5 of Θ(10 ). 0 4 1 Thunk 0 an Empirical Complimentary Cumulative Density 0 1 Thunk 1 0.8 0.6 0 0.2 0.4 0.6 0.8 1 1 0.8 0.6 Most thunks are wrappers around imports, Thunk Plot (ECCDF). The x-axis show indegree, the y- 0.4 0.2 0 0.4 0.2 0 Normal Normal hence in small executables, a larger proportion of axis show the CDF P[X>x] that a function in the functions will be thunks. The same holds for li- endote.exe has indegree of x. If P[X>x] can be braries: The larger the executable, the smaller the shown to fit a Pareto distribution, we can ex- percentage of libraries. This is heavily influenced tract the power law exponent for PMF Pd (f ) (m) by the choice of dynamic vs. static linking. The from the CDF fit (see [1] and more extensively thunk/library plot, listed for completeness reasons, [42] for the relationship between Pareto, power does not give much information, confirming the in- laws and Zipf distributions). Parsing Fig. 4(a)): tuition that they are independent of each other, Blue points denotes the data points (functions) and two descriptive statistics (median and the mostly due to compiler behavior. maximum value) for the indegree distribution for endote.exe. We see that for endnote.exe, 80% 2.5. α fitting with Hill estimator of functions have a indegree=1, 2% indegree >10. and roughly 1% indegree > 20. The fitted dis- Taking our cue from [44] who surveyed empirical tribution is shown in magenta, together with the studies of technological, social, and biological net- parameters α = 1.97 and kc = 1415.83. Al- works, I hypothesize that the discrete distributions though tempting, simply ‘eyeballing’ Pareto CDFs of d+ (f ), d− (f ) and dbb (f ) follows a truncated for the requisite linearity on a log-log scale [23] [6]
  5. 5. D. Bilar / AICom L TEX 2ε Style sample A 5 0 endnote.exe(G81),numfunc=10339 class Basic Block Indegree Outdegree 10 Hill estimator GW N(1.634,0.3) N(2.02, 0.3) N(1.69,0.307) 3 2 MW N(1.7,0.3) N(2.08,0.45) N (1.68,0.35) α(n) t 2.57 1.04 -0.47 ˆ −1 10 1 0 Table 2 0 2 4 10 10 10 α=1.9716 Observations α distribution fitting and testing P[X > x] 10 −2 kc =1415.8378 median=0 17%(malware). Visual inspection indicates that for malware, the model seemed more consistent for datamax=1497 outdegree than indegree at all function sizes. For −3 10 −x Cx−k e kc basic block count, the consistency tends to be bet- −4 ter for smaller executables. I see these tendency for goodware, as well, with the observation that out- 10 0 1 2 3 10 10 10 10 Indegree degree was most consistent in size block Θ(100); (a) GW sample: Fitting for alphaindeg and kc for Θ(10) and Θ(1000). For both malware and 10 0 Exploit.Win32.MsSqlHack(G48),numfunc=170 goodware, indegree seemed the least consistent, 3 Hill estimator quite a few samples did exhibit a so-called ‘Hill 2 Horror Plot’ [47], where αs and the corresponding ˆ CIs were very jittery. α(n) ˆ −1 10 1 0 0 2 4 The fitted power-law exponents αindeg , αoutdeg , αbb , together with individual functions’ callgraph 10 10 10 α=1.6073 Observations P[X > x] 10 −2 kc =50.9956 size are shown Fig. 5. For both classes, the range median=3.5 extends for αindeg ≈ [1.5-3], αoutdeg ≈ [1.1-2.5] and αbb ≈ [1.1-2.1], with a slightly greater spread for 10 −3 datamax=184 malware. −x Cx−k e kc 2.6. Testing for difference −4 10 0 1 2 3 10 10 10 10 Number of Basic Blocks I now check whether there are any statistically (b) MW sample: Fitting for alphabb and kc significant differences between (α, kc ) fit for good- ware and malware, respectively. Following proce-Fig. 4. Pareto fitting ECCDFs, shown with Hill estimatorinset dures in [61], I find αindeg , αoutdeg and αbb dis- tributed approximately normal. The exponentialis not enough: Following [38] on philosophy and cutoff parameters kc are lognormally distributed.[47] on methodology, I calculate the Hill estima- Applying a standard two-tailed t-test (Table 2), Itor α whose asymptotical normality is then used ˆ find at significance level 0.05 (tcritical =1.97) onlyto compute a 95% CI. This is shown in the inset µ(αbb,malware ) ≥ µ(αbb,goodware ).and serves as a Pareto model self-consistency check For the basic blocks, kc ≈ LogN (59.1, 52)that estimates the parameter α as a function of the (goodware) and ≈ LogN (54.2, 44) (malware) andnumber of observations. As the number of obser- µ(kc (bb, malware)) = µ(kc (bb, goodware)) was re-vations i increase, a model that is consistent along jected via Wilcoxon Rank Sum with z = 13.4. Thethe data should show roughly CIi ⊇ CIi+1 . For steeper slope of malware’s αbb imply that functionsan insightful expos´ and more recent procedures e in malware tend to have a lower basic block count.to estimate Pareto tails, see [60][19]. This can be accounted by the fact that malware To tentatively corroborate the consistency of tend to be simpler than most applications andour posited Pareto model, 30 (goodware) and 21 operates without much interaction, hence fewer(malware) indegree, outdegree and basic block EC- branches, hence fewer basic blocks. Malware tendCDF plots were uniformly sampled into three func- to have limited functionality, and operate indepen-tion count blocks, with block criteria chosen as dently of input from user and the operating en-Θ(10), Θ(100) and Θ(1000) function counts, yield- vironment. Also, malware is usually not compileding a sampling coverage of 10 %(goodware) and with aggressive compiler optimization settings en-

×