MeCC: Memory Comparison-based Code Clone Detector
Upcoming SlideShare
Loading in...5
×
 

MeCC: Memory Comparison-based Code Clone Detector

on

  • 1,072 views

A research presentation in ICSE 2011.

A research presentation in ICSE 2011.

Statistics

Views

Total Views
1,072
Views on SlideShare
1,072
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MeCC: Memory Comparison-based Code Clone Detector MeCC: Memory Comparison-based Code Clone Detector Presentation Transcript

  • MeCC: Memory Comparison- based Clone DetectorHeejung Kim1,Yungbum Jung1, Sunghun Kim2, and Kwangkeun Yi1 Seoul National University 12 The Hong Kong University of Science and Technology http://ropas.snu.ac.kr/mecc/ 1
  • Code Clones • similar code fragments (syntactically or semantically)static PyObject * static PyObject *float_add(PyObject *v, PyObject *w) float_mul(PyObject *v, PyObject *w){ { double a,b; double a,b; CONVERT_TO_DOUBLE(v,a); CONVERT_TO_DOUBLE(v,a); CONVERT_TO_DOUBLE(w,b); CONVERT_TO_DOUBLE(w,b); PyFPE_START_PROTECT(“add”,return 0) PyFPE_START_PROTECT(“multiply”,return 0) a = a + b; a = a * b; PyFPE_END_PROTECT(a) PyFPE_END_PROTECT(a) return PyFloat_FromDouble(a); return PyFloat_FromDouble(a);} } 2
  • Applications of Code Clones• software refactoring• detecting potential bugs• understanding software evolution• detecting software plagiarism (malicious duplication) 3
  • Clone Detectors• CCFinder [TSE’02] textual tokens• DECKARD [ICSE’07] AST characteristic vectors• PDG-based [ICSE‘08, SAS’01] program dependence graph Effective for syntactic code clones limited for semantic code clones 4
  • Three code clonesmissed by syntax-based clone detection 5
  • #1 Control ReplacementPyObject *PyBool_FromLong (long ok) static PyObject *get_pybool (int istrue){ { PyObject *result; PyObject *result = if (ok) result = Py_True; istrue? Py_True: Py_False; else result = Py_False; Py_INCREF(result); Py_INCREF(result); return result; return result;} } syntactically different but semantically identical 6
  • #2 Capturing Procedural Effectsvoid appendPQExpBufferChar (PQExpBuffer str, char ch) { /* Make more room if needed *. if (!enlargePQExpBuffer(str, 1)) return; /* OK, append the data */ str->data[str->len] = ch; str->len++; str->data[str->len] = ‘0’;}void appendBinaryPQExpBuffer (PQExpBuffer str, const char* data, size_t datalen) { /* Make more room if needed *. if (!enlargePQExpBuffer(str, datalen)) return; /* OK, append the data */ memcpy(str->data + str->len, data, datalen); understanding memory str->len+= datalen; str->data[str->len] = ‘0’; behavior of procedures} 7
  • ... *set_access_name(cmd_parms *cmd, void *dummy, const char *arg){ void *sconf = cmd->server->module_config; core_server_config *conf = ap_get_module_config(sconf, &core_module); const char *err = ap_check_cmd_context(sconf,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); if (err != NULL) { return err; } conf->access_name = apr_pstrdup(cmd->pool,arg); return NULL;} #3 More Complex Clone... *set_protocol(cmd_parms *cmd, void *dummy, const char *arg){ const char *err = ap_check_cmd_context(cmd,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); core_server_config *conf = ap_get_module_config(cmd->server->module_config, &core_module); char *proto; if (err != NULL) { return err; } proto = apr_pstrdup(cmd->pool,arg); ap_str_tolower(proto); conf->protocol = proto; return NULL; 8}
  • ... *set_access_name(cmd_parms *cmd, void *dummy, const char *arg){ void *sconf = cmd->server->module_config; core_server_config *conf = ap_get_module_config(sconf, &core_module); const char *err = ap_check_cmd_context(sconf,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); if (err != NULL) { return err; } conf->access_name = apr_pstrdup(cmd->pool,arg); return NULL;} statement reordering... *set_protocol(cmd_parms *cmd, void *dummy, const char *arg){ const char *err = ap_check_cmd_context(cmd,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); core_server_config *conf = ap_get_module_config(cmd->server->module_config, &core_module); ￿ char *proto; if (err != NULL) { return err; } proto = apr_pstrdup(cmd->pool,arg); ap_str_tolower(proto); conf->protocol = proto; return NULL; 9}
  • ... *set_access_name(cmd_parms *cmd, void *dummy, const char *arg){ void *sconf = cmd->server->module_config; core_server_config *conf = ap_get_module_config(sconf, &core_module); const char *err = ap_check_cmd_context(sconf,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); if (err != NULL) { return err; } conf->access_name = apr_pstrdup(cmd->pool,arg); return NULL;} statement intermediate reordering variables... *set_protocol(cmd_parms *cmd, void *dummy, const char *arg){ const char *err = ap_check_cmd_context(cmd,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); core_server_config *conf = ap_get_module_config(cmd->server->module_config, &core_module); char *proto; if (err != NULL) { return err; } proto = apr_pstrdup(cmd->pool,arg); ap_str_tolower(proto); conf->protocol = proto; return NULL; 10}
  • ... *set_access_name(cmd_parms *cmd, void *dummy, const char *arg){ void *sconf = cmd->server->module_config; core_server_config *conf = ap_get_module_config(sconf, &core_module); const char *err = ap_check_cmd_context(sconf,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); if (err != NULL) { return err; } conf->access_name = apr_pstrdup(cmd->pool,arg); return NULL;} statement intermediate statement reordering variables splitting... *set_protocol(cmd_parms *cmd, void *dummy, const char *arg){ const char *err = ap_check_cmd_context(cmd,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); core_server_config *conf = ap_get_module_config(cmd->server->module_config, &core_module); char *proto; if (err != NULL) { return err; } proto = apr_pstrdup(cmd->pool,arg); ap_str_tolower(proto); conf->protocol = proto; return NULL; 11}
  • ... *set_access_name(cmd_parms *cmd, void *dummy, const char *arg){ void *sconf = cmd->server->module_config; core_server_config *conf = ap_get_module_config(sconf, &core_module); const char *err = ap_check_cmd_context(sconf,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); if (err != NULL) { return err; } conf->access_name = apr_pstrdup(cmd->pool,arg); return NULL;} statement intermediate statement reordering variables splitting... *set_protocol(cmd_parms *cmd, void *dummy, const char *arg){ const char *err = ap_check_cmd_context(cmd,NOT_IN_DIR_LOC_FILE | NOT_IN_LIMIT); core_server_config *conf = ap_get_module_config(cmd->server->module_config, &core_module); char *proto; if (err != NULL) { return err; } proto = apr_pstrdup(cmd->pool,arg); ap_str_tolower(proto); conf->protocol = proto; return NULL; 12}
  • These Semantic Clonesare Identified by MeCC 13
  • MeCC: Our Approach• Static analyzer estimates the semantics of programs• Abstract memories are results of analysis• Comparing abstract memories is a measure 14
  • Clone Detection Process procedures P ￿ P1 P2 P3 P4program 15
  • Clone Detection Process procedures P ￿ abstract P1 P2 memories P3 P4 Static ￿ ￿ F (P ) = Mprogram Analyzer
  • Clone Detection Process procedures P ￿ abstract P1 P2 memories P3 P4 Static ￿ ￿ F (P ) = Mprogram Analyzer Comparing Memories S(M, M￿ ) similarities 17
  • Clone Detection Process procedures P ￿ abstract P1 P2 memories P3 P4 Static ￿ ￿ F (P ) = Mprogram Analyzer Comparing Memories Code Clones Grouping P1 P2 S(M, M￿ ) P3 P4 similarities 18
  • Clone Detection Process procedures P ￿ abstract P1 P2 memories P3 P4 Static ￿ ￿ F (P ) = Mprogram Analyzer Comparing Memories Code Clones Grouping P1 P2 S(M, M￿ ) P3 P4 similarities 19
  • Estimating Semantics by log MinEntry Abstract Memories S(M1 , M2 ) log(| M1 | + | M2 |) 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82int make (list *a, int count){ int r = count + 1; Address Values if (a!=0){ a ￿→ {(true, α)} a->next = malloc(...); count ￿→ {(true, β)} a->next->val = count; r ￿→ {(true, β + 1)} } else { α.next ￿→ {(α ￿= 0, ￿)} return r - 1; ￿.val ￿→ {(α ￿= 0, β)} } RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} return r;} a ￿→ {(true, α)} b ￿→ {(true, β)}• Estimating an abstract memory at the α.n ￿.v ￿→ ￿→ {(α ￿= 0, ￿)} {(α ￿= 0, β)} procedure’s exit point RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)} {}, {} ￿ P ⇓ v, M• Abstract memory is a map from abstract {}, {} ￿ P : τ addresses to abstractlist next} type list = {int x, values 20 let list node = {x:=1, next:={}}
  • Estimating Semantics by log MinEntry Abstract Memories S(M1 , M2 ) log(| M1 | + | M2 |) 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82int make (list *a, int count){ int r = count + 1; Address Values if (a!=0){ a ￿→ {(true, α)} a->next = malloc(...); count ￿→ {(true, β)} a->next->val = count; r ￿→ {(true, β + 1)} } else { α.next ￿→ {(α ￿= 0, ￿)} return r - 1; ￿.val ￿→ {(α ￿= 0, β)} } RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} return r;} a ￿→ {(true, α)} b ￿→ {(true, β)}• Estimating an abstract memory at the α.n ￿.v ￿→ ￿→ {(α ￿= 0, ￿)} {(α ￿= 0, β)} procedure’s exit point RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)} {}, {} ￿ P ⇓ v, M• Abstract memory is a map from abstract {}, {} ￿ P : τ addresses to abstractlist next} type list = {int x, values 21 let list node = {x:=1, next:={}}
  • Estimating Semantics by log MinEntry Abstract Memories S(M1 , M2 ) log(| M1 | + | M2 |) 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82 int make (list *a, int count){ int r = count + 1; Address Values if (a!=0){ a ￿→ {(true, α)} a->next = malloc(...); count ￿→ {(true, β)} a->next->val = count; r ￿→ {(true, β + 1)} } else { α.next ￿→ {(α ￿= 0, ￿)} return r - 1; ￿.val ￿→ {(α ￿= 0, β)} } RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} return r; } a ￿→ {(true, α)} b ￿→ {(true, β)}• Use symbols for unknown input values α.n ￿→ {(α ￿= 0, ￿)} ￿.v ￿→ {(α ￿= 0, β)} RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)}• All abstract values are guarded by execution {}, {} ￿ P ⇓ v, M path conditions {}, {} ￿ P : τ type list = {int x, list next} 22 let list node = {x:=1, next:={}}
  • Estimating Semantics by log MinEntry Abstract Memories S(M1 , M2 ) log(| M1 | + | M2 |) 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82 int make (list *a, int count){ int r = count + 1; Address Values if (a!=0){ a ￿→ {(true, α)} a->next = malloc(...); count ￿→ {(true, β)} a->next->val = count; r ￿→ {(true, β + 1)} } else { α.next ￿→ {(α ￿= 0, ￿)} return r - 1; ￿.val ￿→ {(α ￿= 0, β)} } RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} return r; } a ￿→ {(true, α)} b ￿→ {(true, β)}• Use symbols for unknown input values α.n ￿→ {(α ￿= 0, ￿)} ￿.v ￿→ {(α ￿= 0, β)} RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)}• All abstract values are guarded by execution {}, {} ￿ P ⇓ v, M path conditions {}, {} ￿ P : τ type list = {int x, list next} 23 let list node = {x:=1, next:={}}
  • Estimating Semantics by log MinEntry Abstract Memories S(M1 , M2 ) log(| M1 | + | M2 |) 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82int make (list *a, int count){ int r = count + 1; Address Values if (a!=0){ a ￿→ {(true, α)} a->next = malloc(...); count ￿→ {(true, β)} a->next->val = count; r ￿→ {(true, β + 1)} } else { α.next ￿→ {(α ￿= 0, ￿)} return r - 1; ￿.val ￿→ {(α ￿= 0, β)} } RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} return r;} a ￿→ {(true, α)} b ￿→ {(true, β)}copy and modify α.n ￿.v ￿→ ￿→ {(α ￿= 0, ￿)} {(α ￿= 0, β)} RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)}int make2 (list2 *a, int b){ if (a==0) return b; {}, {} ￿ P ⇓ v, M a->n = malloc(...); a->n->v = b; return b + 2; {}, {} ￿ P : τ} type list = {int x, list next} 24 let list node = {x:=1, next:={}}
  • Estimating Semantics by log MinEntry Abstract Memories S(M1 , M2 ) log(| M1 | + | M2 |) 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82int make (list *a, int count){ int r = count + 1; Address Values log MinEntry if (a!=0){ a S(M￿→M ) log(| M1 {(true, α)} , 1 2 | + | M2 |) a->next = malloc(...); count ￿→ {(true, β)} a->next->val = count; r ￿→ {(true, β + 1)} } else { 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82 α.next ￿→ {(α ￿= 0, ￿)} return r - 1; ￿.val ￿→ {(α ￿= 0, β)} } a RETV ￿→ ￿→ {(α = 0, {(true, α)}(α ￿= 0, β + 1)} β + 1 − 1), return r; count ￿→ {(true, β)}} r a￿→ ￿→ {(true, 1)} {(true, β +α)} α.next b￿→ ￿→ {(true, β)} {(α ￿= 0, ￿)}copy and modify ￿.val α.n ￿→￿→ {(α ￿= ￿= 0, ￿)} {(α 0, β)} RETV Address ￿→ = 0, β + Values(α ￿= 0, β + 1)} ￿.v {(α ￿→ {(α ￿= 1 − 1), 0, β)} RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)}int make2 (list2 *a, int b){ a ￿→ {(true, α)} if (a==0) return b; b ￿→ {}, {} {(true, β)} ￿ P ⇓ v, M a->n = malloc(...); α.n ￿→ {(α ￿= 0, ￿)} a->n->v = b; ￿→ return b + 2; ￿.v {}, {(α ￿= 0, τ {} ￿ P : β)} RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)}} type list = {int x, list next} {}, {} ￿ P ⇓ v, M 25 let list node = {x:=1, next:={}}
  • Clone Detection Process procedures P ￿ abstract P1 P2 memories P3 P4 Static ￿ ￿ F (P ) = Mprogram Analyzer Comparing Memories Code Clones Grouping P1 P2 S(M, M￿ ) P3 P4 similarities 26
  • a ￿→ {(true, α)} log MinEntry count ￿→ {(true, β)} S(M1 , M2 ) log(| M1 | + | M2 |) r ￿→ {(true, β + 1)} Comparing Abstract Memories 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82 α.next ￿.val ￿→ ￿→ {(α ￿= 0, ￿)} {(α ￿= 0, β)} RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} a ￿→ {(true, α)} a ￿→ {(true, α)} count ￿→ {(true, β)} b ￿→ {(true, β)} r ￿→ {(true, β + 1)} α.n ￿→ {(α ￿= 0, ￿)} α.next ￿→ {(α ￿= 0, ￿)} ￿.v ￿→ {(α ￿= 0, β)} ￿.val ￿→ {(α ￿= 0, β)} RETV a {(tru ￿→ {(α = 0, β), (α ￿= 0, β + 2)} RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} count {(tru {}, {} ￿ P ⇓ v, M aa ￿→ {(true, α)} {(true, α)} r {(true, b ￿→ {(true, β)} α.next {(α ￿= count α.n ￿→ {(α ￿= 0, ￿)} {(true,log MinEntry β)} {}, {} ￿ P : τ α.val {(α ￿= r ￿→ {(α ￿= 0, β)} M2 ) log(| M1 | + | M2 |) {(true, β + 1)} 1. Classifying addresses into similar classes ￿.v α.next S(M1 , type list = {int x, list next} MinEntry RETV a log RETV ￿→ {(α = 0, β), (α ￿= 0, {(α2)} 0, ￿)} log MinEntry β + ￿= {(true, α)} {(α ￿= 0, β + 1 − 1 S(M , M log(| M1 | + | M2 |) a {(true α.val {}, {} ￿ 2(2letM S(M1.0 +21)1·log(| )M+ |5) = M2 |) {(true, β)} {(α ,￿= {x:=1,2 next:={}} 0.82 list node =0, β)} count + | ·v, local return parameters P ⇓in1.0 + 2 · 1 M field addresses {(true, β + 1)} {(true 0.5)/(6 1 r 0, β + 1)} {(true, α)} b RETV {(α ￿= 0, βa 1 − 1), (α = + a node.next.x 2(2{(true, α)} α.next address variables· 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) ={(αα.n0, ￿)} ￿= 0.82β)} {(α ￿= 0 {}, {} ￿ P : τ count ￿.val 1.0 + 1α.n ￿.v α.val {(true,￿α.v β)} +{(true, in0.5)/(6 + 5) = 0.82 = 0, 2(2x· 1.0 {a:=1,α)} β)}E 2 · b:=2} · {(α {(α ￿= 0 count a let {(true, :== {int x, list next} a b a r RETV {(true, ￿.val α)} {(true, β)}βlist1)} ￿.v {(true, βRETV (α{(α ￿= 0, β), (α {(α ￿= 0, β + 1 −+ 1)} 0, β + 1)} 1), = r type list{(true, {(true, α)} ￿→ α.next x, = {int + next} {(α {(α ￿)}0, ￿.vprev} 0, x, tsil α.n ￿.val β)} α.nextcount type tsil =￿={(true,￿)} β)}ode = {x:=1, next:={}} countα.n {int ￿= {(true, a {(α {(true,￿)} {}, {} ￿ P ⇓ v ￿= 0, α)} ￿→ 27 ￿→ ￿= 0, β b {(true, α)} ￿= 0, β)} α.val {(true, a{(α ￿= 0, β)} β)} + 1)} {(α {(true, β)}xt.x r α.v α.val r let￿→ {(true, β + {(α ... {x:=1, next:={}} 1)}
  • a ￿→ {(true, α)} log MinEntry count ￿→ {(true, β)} S(M1 , M2 ) log(| M1 | + | M2 |) r ￿→ {(true, β + 1)} Comparing Abstract Memories 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82 α.next ￿.val ￿→ ￿→ {(α ￿= 0, ￿)} {(α ￿= 0, β)} RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} a ￿→ {(true, α)} a ￿→ {(true, α)} count ￿→ {(true, β)} b ￿→ {(true, β)} r ￿→ {(true, β + 1)} α.n ￿→ {(α ￿= 0, ￿)} α.next ￿→ {(α ￿= 0, ￿)} ￿.v ￿→ {(α ￿= 0, β)} ￿.val ￿→ {(α ￿= 0, β)} RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)} RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} a {(true, α)} {}, {} ￿ P ⇓ v, M counta ￿→ {(true, α)} β)} {(true, b ￿→ {(true, β)} r {(true, ￿)}+ 1)} {(α ￿= 0, β {}, {} ￿ P : τ α.n ￿→ α.next ￿.v 2. Compareβ)} ￿)} ￿→ {(α ￿= guarded values in the same {(α ￿= 0, 0, type list = {int x, list next} α.val similar classes (score 0.0 to 1.0) RETV ￿→ {(α = 0, {(α ￿= 0, β + 2)} β), (α ￿= 0, β)} RETV {(α {}, {} β P ⇓letM1), (α ￿= 0, β + 1)} − = 0, ￿ + 1v, list node = {x:=1, next:={}} a in {(true, α)} count {(true, α)} α)} a {(true, {(true, β)} {}, {} ￿ P : τ node.next.x score 1.0 t r b {(true, β)} x{(true, β b:=2} in E {(true, β)} let := {a:=1, + 1)} α.next = 0, β+ ￿= − ￿)}(α￿= 0, ￿)} 1)} = {int x, listα.n next} {(α 1 0, {(α {(true, β +1)} 1), ￿= 0, β + {(α α.val{(α ￿= {(α ￿= 0, β)}= ￿= 0, β)} α.v 0, ￿)} ￿={(αβ + 2)} type list score {int x, list next}ext = {x:=1, next:={}}{(α = 0, β), (α tsil = {int x, tsil prev} ode 0.5 RETV RETV {(α =typeβ +0, − 1), (α ￿= 0, β + 1)} 0, 1 28al xt.x {(α ￿= 0, β)} let ... {x:=1, next:={}}
  • a ￿→ {(true, α)} log MinEntry count ￿→ {(true, β)} S(M1 , M2 ) log(| M1 | + | M2 |) r ￿→ {(true, β + 1)} Comparing Abstract Memories 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82 α.next ￿.val ￿→ ￿→ {(α ￿= 0, ￿)} {(α ￿= 0, β)} RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} a ￿→ {(true, α)} a ￿→ {(true, α)} count ￿→ {(true, β)} b ￿→ {(true, β)} r ￿→ {(true, β + 1)} α.n ￿→ {(α ￿= 0, ￿)} α.next ￿→ {(α ￿= 0, ￿)} ￿.v ￿→ {(α ￿= 0, β)} ￿.val ￿→ {(α ￿= 0, β)} {(true, α)}RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)} RETV ￿→ {(α = 0, β + 1 − 1), (α ￿= 0, β + 1)} {}, {} ￿ P ⇓ v, M a ￿→ {(true, α)} ￿→(4 × 1.0 + 1 β)} 0.0 + 4 × 1.0 + 2 × {(true, × 0.5) 3. Find the best combination that maximizes the b α.n ￿→ {}, = ￿ P : τ {} 0.82 {(α ￿= 0, ￿)} 6 + 5 total score ￿.v ￿→ {(α ￿= 0, β)} type list = {int x, list next} RETV ￿→ {(α = 0, β), (α ￿= 0, β + 2)} maximum score {}, {} ￿ P ⇓ v, M1 , M2 ) = S(Mlist node = {x:=1, next:={}} let in {(true, α)} 1 | + | M2 | |M node.next.x {}, {} ￿ P : τ | {a:=1, − F(c￿ )E| let x := F(c) b:=2} in (4 × 1.0 + 1 × 0.0 + 4 × 1.0 + 2 × 0.5)= {int x, list next} type list = {int x, list next} = 0.82 ≥ 0.8ode = {x:=1, next:={}} type6tsil 5 {int x, tsil prev} + = 29 < 10xt.x let ... {x:=1, next:={}}
  • Experimental Results 30
  • Subject Projects Projects KLOC Procedures Application Python 435 7,657 interpreter Apache 343 9,483 web serverPostgreSQL 937 10,469 database 31
  • Detected Clones Total 623 6% 2% code clones 39% 53% Type-1 Type-2 Type-3 Type-4C. K. Roy and J. R. Cordy. A survey on software clone detection research. SCHOOL OF COMPUTING TR 2007-541, QUEENʼS UNIVERSITY, 115, 2007.
  • Semantic Clones45% Total 623 6% 2% code clones 39% 53% Type-1 Type-2 Type-3 Type-4
  • Comparison CCfinder CCfinder textual tokensPDG-basedDECKARD PDG-based MeCC program 0 75 150 225 300 dependency graphs CCfinder DECKARDPDG-based characteristic vectorsDECKARD MeCC Type-3 Type-4 0 10 20 30 40 34
  • Applications of Code Clones• software refactoring• detecting potential bugs• understanding software evolution• detecting software plagiarism (malicious duplication) 35
  • Finding Potential Bugs• A large portion of semantic clones are due to inconsistent changes• Inconsistent changes may lead to potential bugs (inconsistent clones) Two semantic clones with potential bugs 36
  • #1 Missed Null Checkconst char *GetVariable (VariableSpace space, const char *name){ struct_variable *current; if (!space) parameter name also should be checked! return NULL; for (current=space->next;current;current=current->next) { if (strcmp(current->name,name) == 0) { return current->value; } } return NULL;}const char *PQparameterStatus (const PGconn *conn, const char *paramName){ const pgParameterStatus *pstatus; if (!conn || !paramName) return NULL; for (pstatus=conn->pstatus; pstatus!=NULL; pstatus = pstatus->next) { if (strcmp(pstatus->name,paramName)== 0) return pstatus->value; } return NULL;} 37
  • #2 A Resource Leak BugPyObject *pwd_getpwall (PyObject *self){ PyObject *d; struct passwd *p; if ((d = PyList_New(0)) == NULL) return NULL; setpwent(); open user database while ((p = getpwent()) != NULL) { PyObject *v = mkpwent(p); if (v==NULL || PyList_Append(d,v)!=0) { Py_XDECREF(v); Py_DECREF(d); return NULL; A resource leak without } Py_DECREF(v); endpwent() procedure call } endpwent(); close user database return d;} Python project revision #20157 38
  • A Bug-free Procedure PyObject *spwd_getspall (PyObject *self,PyObject *pwd_getpwall (PyObject *self) PyObject *args){ { PyObject *d; PyObject *d; struct passwd *p; struct spwd *p; if ((d = PyList_New(0)) == NULL) if ((d = PyList_New(0)) == NULL) return NULL; return NULL; setpwent(); setspent(); while ((p = getpwent()) != NULL) { while ((p = getspent()) != NULL) { PyObject *v = mkpwent(p); PyObject *v = mkspent(p); if (v==NULL || PyList_Append(d,v)!=0) { if (v==NULL || PyList_Append(d,v)!=0) { Py_XDECREF(v); Py_XDECREF(v); Py_DECREF(d); Py_DECREF(d); endspent(); return NULL; return NULL; } } Py_DECREF(v); Py_DECREF(v); } } endpwent(); endspent(); return d; return d;} } Python project revision #38359 39
  • The Bug is Fixed Later PyObject *spwd_getspall (PyObject *self,PyObject *pwd_getpwall (PyObject *self) PyObject *args){ { PyObject *d; PyObject *d; struct passwd *p; struct spwd *p; if ((d = PyList_New(0)) == NULL) if ((d = PyList_New(0)) == NULL) return NULL; return NULL; setpwent(); setspent(); while ((p = getpwent()) != NULL) { while ((p = getspent()) != NULL) { PyObject *v = mkpwent(p); PyObject *v = mkspent(p); if (v==NULL || PyList_Append(d,v)!=0) { if (v==NULL || PyList_Append(d,v)!=0) { Py_XDECREF(v); Py_XDECREF(v); Py_DECREF(d); Py_DECREF(d); endpwent(); return NULL; bug-fixed endspent(); return NULL; } } Py_DECREF(v); Py_DECREF(v); } } endpwent(); endspent(); return d; return d;} } Python project revision #73017 40
  • Procedure A was created revision #20157 with a resource leak Procedure B (a code clone of A) revision #38359 is introduced without resource leaks4 years the resource leak can be fixed if MeCC were applied The resource leak bug in revision #73017 procedure A is fixed 41
  • const char *GetVariable (VariableSpace space, const char *name) const char *PQparameterStatus (const PGconn *conn, const char *paramName){ { struct_variable *current; const pgParameterStatus *pstatus; if (!space) if (!conn || !paramName) return NULL; return NULL; for (current=space->next;current;current=current->next) for (pstatus=conn->pstatus; pstatus!=NULL; pstatus = pstatus->next) { { if (strcmp(current->name,name) == 0) if (strcmp(pstatus->name.paramName)== 0) { return pstatus->value; return current->value; } } return NULL; } } return NULL;} MeCC successfully identifies these procedures PyObject *spwd_getspall (PyObject *self, PyObject *pwd_getpwall (PyObject *self) PyObject *args) { { PyObject *d; PyObject *d; struct passwd *p; struct spwd *p; if ((d = PyList_New(0)) == NULL) if ((d = PyList_New(0)) == NULL) return NULL; return NULL; setpwent(); setspent(); while ((p = getpwent()) != NULL) { while ((p = getspent()) != NULL) { PyObject *v = mkpwent(p); PyObject *v = mkspent(p); if (v==NULL || PyList_Append(d,v)!=0) { if (v==NULL || PyList_Append(d,v)!=0) { Py_XDECREF(v); Py_XDECREF(v); Py_DECREF(d); Py_DECREF(d); endspent(); return NULL; return NULL; } } Py_DECREF(v); Py_DECREF(v); } } endpwent(); endspent(); return d; return d; } } 42
  • Potential Bugs and Code Smells #Semantic Potential Code Clones Bugs (%) Smells (%)Python 95 26 (27.4%) 23 (24.2%)Apache 81 8 ( 9.9%) 27 (33.3%)PostgreSQL 102 21 (20.6%) 20 (19.6%)Total 278 55 (19.8%) 70 (25.2%) detected by MeCC 43
  • Study Limitation• Projects are open source and may not be representative• All clones are manually inspected• Default options are used for other tools (CCfinder, Deckard, PDG-based) 44
  • Conclusion• MeCC: Memory Comparison-based Clone Detector • a new clone detector using semantics- based static analysis • tolerant to syntactic variations • can be used to find potential bugs 45
  • Thank You! http://ropas.snu.ac.kr/mecc/ 46
  • Backup Slides 47
  • Time Spent Projects KLOC FP Total Time Python 435 39 264 1h Apache 343 24 191 5hPostgreSQL 937 47 278 7h Ubuntu 64-bit machine with a 2.4 GHz Intel Core 2 Quad CPU and 8 GB RAM.• False positive ratio is less than 15%• Slower than other tools (deep semantic analysis) 48
  • Structure Initialization 49
  • Structure Initialization 50
  • Judgement of Clones• Two parameters • In our experiment, similarity threshold 0.8 is used • Penalty function for small size of code clones log MinEntry S(M1 , M2 ) log(| M1 | + | M2 |) 2(2 · 1.0 + 2 · 1.0 + 1 · 0.5)/(6 + 5) = 0.82 51 a {(true, α)}
  • Static Analyzer• Flow-sensitive• Context-sensitive by procedural summaries• Path-sensitive• Abstract interpretation http://spa-arrow.com 52