Output fica.beamer.43

.
Filtering Clones for
Individual User Based on
. Machine Learning Analysis

Jiachen Yang, Keisuke Hotta, Yoshiki Higo,
Hiroshi Igaki, Shinji Kusumoto
Graduate School of Information Science and Technology, Osaka University

June 4, 2012

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Jiachen Yang (IST, Osaka-U) Fica@IWSC2012 June 4, 2012 1 / 14

Motivating Example
Participants of survey

Clonesets
Red: Un-interesting
Blue: Interesting

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Motivating Example
Participants of survey
1 2 3 4 5 6 7 8

Clonesets
Red: Un-interesting
Blue: Interesting

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Interesting U:0 vs I:8

1542 static . har *.
c . 126 c
. har *.
.
1543 . istory_substring ( string , start , end).
h . 127 . ubstring ( string , start , end).
s .
1544 . const char *string;. . 128 . const char *string;. .
1545 . int start , end;.. 129 . int start , end;. .
1546 . .
{ 130 . .
{
1547 . register int len ;. . 131 . register int len ;. .
1548 . register char *result ;. . 132 . register char *result ;. .
1549 . len = end − start;. . 133 . len = end − start;. .
1550 . result = (char *)xmalloc (len + 1);. . 134 . result = (char *)xmalloc (len + 1);. .
1551 . strncpy ( result , string + start, len);. . 135 . strncpy ( result , string + start, len);.
.
1552 . result [ len ] = '0';. . 136 . result [ len ] = '0';. .
1553 . return result ;. . 137 . return ( result );..
1554 . .
} 138 . .
}

(a) lib/readline/histexpand.c (b) stringlib.c
Figure: Example of source code in bash-4.2
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Un-Interesting U:8 vs I:0

191 ... __P((char *, arrayind_t, . har *));. 309 static
c . int run_one_command __P((. har *));.
c .
192 .static intmax_t subexpr __P((char *));. 310 .static
. int run_wordexp __P((char *));. .
193 .static intmax_t expcomma __P((void));.311 .static
. int uidget __P((void));..
194 .static intmax_t expassign __P((void));. 312 .static
. void init_interactive __P((void));. .
195 .static intmax_t expcond __P((void));. 313 .static
. void init_noninteractive __P((void));..
196 .static intmax_t explor __P((void));. . 314 .static void init_interactive_script __P((void));..
197 .static intmax_t expland __P((void. );
)
. 315 .static void set_shell_name __P((char. *));
.

(a) expr.c (b) shell.c

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Disagreed U:4 vs I:4

710 static int
711 displen (s) 1098 else
712 const char *s; 1099 {
713 { 1100 if ( wcharlist == 0)
714 wchar_t *wcstr; 1101 {
715 size_t wclen, slen ; 1102 size_t len. .
;
716 wcstr = 0..; 1103 . len = mbstowcs (wcharlist, charlist , 0);.
.
717 . len = mbstowcs (wcstr, s, 0);.
s . 1104 . if (len == −1). .
718 .if (slen == −1). . 1105 . len = 0;..
719 . slen = 0;. . 1106 . wcharlist = (wchar_t *)xmalloc (sizeof .... .
720 w
. cstr = (wchar_t *)xmalloc (sizeof ....
. 1107 . mbstowcs (wcharlist, charlist , len + 1);..
721 m
. bstowcs (wcstr, s, slen + 1);.
. 1108 }
722 wclen = wcswidth (wcstr, slen); 1109 if (wcschr (wcharlist , wc))
723 free (wcstr); 1110 break;
724 return (( int)wclen); 1111 }
725 }
(b) subst.c
(a) execute_cmd.c
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Fica — the name

Filter for
Individual user on code
Clone
Analysis
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Fica — the website

Figure: Snapshot of Fica

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

... ... ........ ........ ........ ....... . . .... .

Compare Code Clone Similarity

Pi = possibility to be interesting
Pu = possibility to be un-interesting
Len Pi Pi /Pu Pu Comp
50 5.56% 1.18 4.72% O
87 2.89% 1.11 2.59% O
79 1.97% 0.69 2.87% X
63 3.55% 0.64 5.57% O
77 2.66% 0.46 5.83% X

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Good Experiment Result
All training 44 Matched 32 un-interesting 1
All evaluation 34 Accuracy 94.12% interesting 1

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Bad Experiment Result
All training 47 Matched 14 un-interesting 16
All evaluation 31 Accuracy 45.16% interesting 1

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Open Question

How to improve accuracy?
By combining metrics like McCabe Cyclomatic
Complexity?
Thank you!

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Unmatched: User un-interesting

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Unmatched: User interesting

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Overall Workflow
. Submits source code
1

.
2 Detects clones

.
3 Mark clones as “interesting”

or not
. Records marked clones into
4

database
.
5 Studies characteristics of

marks using machine learning
Figure: Overall Workﬂow
algorithms of Fica with CDT
.
6 Ranks unmarked clones based

on machine learning
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Predicting Category

−− −→ −− −→
−−− −−−
sim(a, b, D) = tfidf(a, D) · tfidf(b, D) (5)
{
0 , sim(a, b, D) = 0
nsim(a, b, D) = sim(a,b,D) (6)
|sim(a,b,D)| , otherwise

{
∑
1 , |M| = 0
poss(t, M) = ∀m∈M nsim(t,m,M)
(7)
|M| , otherwise

. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Result — bash
A B C D E F G H
100

75
Accuracy (%)

50

25

0
10 20 30 40 50 60 70 80 90 100
Percentage of Training Set (%)
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Result — git
A B C D E F G H
100

75
Accuracy (%)

50

25

0
10 20 30 40 50 60 70 80 90 100
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Result — xz
A B C D E F G H
100

75
Accuracy (%)

50

25

0
10 20 30 40 50 60 70 80 90 100
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Result — e2fsprogs
A B C D E F G H
100

75
Accuracy (%)

50

25

0
10 20 30 40 50 60 70 80 90 100
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Result — All Projects
A B C D E F G H
100

75
Accuracy (%)

50

25

0
10 20 30 40 50 60 70 80 90 100
. . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Output fica.beamer.43

Recommended

Recommended

More Related Content

More from Jiachen Yang

More from Jiachen Yang (7)

Recently uploaded

Recently uploaded (20)

Output fica.beamer.43