EXPANDING IDENTIFIERS TO                NORMALIZING SOURCE                 CODE VOCABULARY                            PRES...
VOCABULARY MISMATCH                        DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER                        SOFTWARE A...
PURPOSE OF NORMALIZE                        COPE WITH VOCABULARY MISMATCH                         SOURCE CODE             ...
EXAMPLE PROBLEMS                        CONSIDER IDENTIFIERS                         FEATURELOCATION                      ...
EXAMPLE PROBLEMS                        CONSIDER IDENTIFIERS                         FEATURE LOCATION      SPLITTING PROBL...
EXAMPLE PROBLEMS                        CONSIDER IDENTIFIERS                         FEATURE LOCATION      SPLITTING PROBL...
EXAMPLE PROBLEMS                        CONSIDER IDENTIFIERS                         FEATURE LOCATION      SPLITTING PROBL...
WHY NORMALIZE?                        MANY SE PROBLEMS CAN BE ADDRESSED USING                        INFORMATION RETRIEVAL...
NORMALIZE PROBLEM STATEMENT                        FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS                        ...
NORMALIZE ALGORITHM                        TERMINOLOGY                         HARD-WORD - WHITEHOUSE_LAWN                ...
NORMALIZE ALGORITHM                        TERMINOLOGY                         HARD-WORD - WHITEHOUSE_LAWN    (2)         ...
NORMALIZE ALGORITHM                        TERMINOLOGY                         HARD-WORD - WHITEHOUSE_LAWN    (2)         ...
NORMALIZE ALGORITHMFriday, October 7, 11
NORMALIZE ALGORITHM                        STRLEN    STRING LENGTHFriday, October 7, 11
MACHINE TRANSLATION                             APPROACH                        EL   PAPA   VISITA   LA   IGLESIAFriday, O...
MACHINE TRANSLATION                              APPROACH                        EL   PAPA  VISITA LA IGLESIA             ...
MACHINE TRANSLATION                              APPROACH                        EL   PAPA  VISITA LA IGLESIA             ...
MACHINE TRANSLATION                              APPROACH                        EL   PAPA   VISITA LA IGLESIA            ...
MACHINE TRANSLATION                              APPROACH                        EL   PAPA   VISITA LA IGLESIA            ...
NORMALIZE ALGORITHMFriday, October 7, 11
NORMALIZE ALGORITHM       STRLENFriday, October 7, 11
NORMALIZE ALGORITHM       STRLEN       S-TRLEN        ST-RLEN       STR-LEN       STRL_EN       STRLE_N       S_T_RLEN    ...
NORMALIZE ALGORITHM       STRLEN       S-TRLEN                                E(RLEN) = {RIFLEMEN}        ST-RLEN       ST...
NORMALIZE ALGORITHM       STRLEN       S-TRLEN                                E(RLEN) = {RIFLEMEN}        ST-RLEN         ...
NORMALIZE ALGORITHM       STRLEN                              E(ST) = {SET, STOP, STRING}       S-TRLEN                   ...
NORMALIZE ALGORITHM PART I             STR                                VS                STRING               STEERFrid...
NORMALIZE ALGORITHM PART I             STR                                  VS                         LENDER             ...
NORMALIZE ALGORITHM PART I             STR                                         VS                                LENDE...
NORMALIZE ALGORITHM PART I             STR                                         VS                         LENDER      ...
NORMALIZE ALGORITHM PART I             STR                                         VS                         LENDER      ...
NORMALIZE ALGORITHM PART I             STR                                         VS                         LENDER      ...
NORMALIZE ALGORITHM PART I             STR                                         VS                         LENDER      ...
NORMALIZE ALGORITHM PART II                                  VS                        STR-LEN        ST-RLENFriday, Octob...
NORMALIZE ALGORITHM PART II                                        VS                          STR-LEN              ST-RLE...
NORMALIZE ALGORITHM PART II                                        VS                          STR-LEN              ST-RLE...
NORMALIZE ALGORITHM PART II                                         VS                          STR-LEN                 ST...
NORMALIZE ALGORITHM PART II                                         VS                          STR-LEN                 ST...
NORMALIZE ALGORITHM PART II                                         VS                          STR-LEN                 ST...
ADDING CONTEXTFriday, October 7, 11
ADDING CONTEXT             DIRFriday, October 7, 11
ADDING CONTEXT             DIR        E(DIR) = {DIRECTION, DIRECTORY}Friday, October 7, 11
ADDING CONTEXT             DIR         E(DIR) = {DIRECTION, DIRECTORY}                        CONTEXT = {FORWARD, BACKWARD...
ADDING CONTEXT             DIR             E(DIR) = {DIRECTION, DIRECTORY}                            CONTEXT = {FORWARD, ...
NORMALIZE IMPLEMENTATION                        USES GenTest TO SPLIT IDENTIFIERS                          RETURNS MULTIPL...
EVALUATION                    Program             Loc        SLoc     Unique Ids                    which-2.20         3,6...
EVALUATION                        THREE GROUPS OF IDENTIFIERS                          STANDARD LIBRARY CALLS             ...
EVALUATION                        THREE GROUPS OF IDENTIFIERS                          STANDARD LIBRARY CALLS             ...
EVALUATION                        THREE GROUPS OF IDENTIFIERS                          STANDARD LIBRARY CALLS             ...
EXAMPLE EXPANSIONS                          id           Top 10         Top Expansion                                     ...
RESEARCH QUESTIONS                        WHAT IS THE OVERALL ACCURACY OF NORMALIZE?                        DOES THE VOCAB...
ACCURACY ON DOMAIN IDSFriday, October 7, 11
SOURCE OF EXPANSION WORDS                        SOURCE CODE                        INTERNAL DOCUMENTATION                ...
BEST VOCABULARY SOURCE?Friday, October 7, 11
FUTURE WORK                        EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE                        DATA               ...
SUMMARY                        IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER                        SOFTWARE DOCUMENTS   ...
QUESTIONS?                         Need an identifier split?                        GenTest Splitter available at          ...
Upcoming SlideShare
Loading in...5
×

Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

383

Published on

Paper: Expanding Identifiers to Normalize Source Code Vocabulary

Authors: Dave Binkley and Dawn Lawrie

Session: Research Track 4: Natural Language Analysis

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
383
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary"

  1. 1. EXPANDING IDENTIFIERS TO NORMALIZING SOURCE CODE VOCABULARY PRESENTED BY DAWN LAWRIE LOYOLA UNIVERSITY MARYLAND IN COLLABORATION WITH DAVE BINKLEYFriday, October 7, 11
  2. 2. VOCABULARY MISMATCH DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS EXAMPLE REQUIREMENT - “FEATURE LOCATION” SOURCE CODE - “FEATURELOCATION” OR WORSE “FLOC”Friday, October 7, 11
  3. 3. PURPOSE OF NORMALIZE COPE WITH VOCABULARY MISMATCH SOURCE CODE OTHER SOFTWARE DOCUMENTSFriday, October 7, 11
  4. 4. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURELOCATION FLOCFriday, October 7, 11
  5. 5. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FLOCFriday, October 7, 11
  6. 6. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM F LOC SPLITTING PROBLEMFriday, October 7, 11
  7. 7. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FEATURE LOCATION SPLITTING AND EXPANSION PROBLEMFriday, October 7, 11
  8. 8. WHY NORMALIZE? MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDSFriday, October 7, 11
  9. 9. NORMALIZE PROBLEM STATEMENT FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS FLOC FEATURE LOCATIONFriday, October 7, 11
  10. 10. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN SOFT-WORD - WHITE-HOUSE_LAWNFriday, October 7, 11
  11. 11. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWNFriday, October 7, 11
  12. 12. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWN (3)Friday, October 7, 11
  13. 13. NORMALIZE ALGORITHMFriday, October 7, 11
  14. 14. NORMALIZE ALGORITHM STRLEN STRING LENGTHFriday, October 7, 11
  15. 15. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIAFriday, October 7, 11
  16. 16. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HITFriday, October 7, 11
  17. 17. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HITFriday, October 7, 11
  18. 18. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONGFriday, October 7, 11
  19. 19. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONGFriday, October 7, 11
  20. 20. NORMALIZE ALGORITHMFriday, October 7, 11
  21. 21. NORMALIZE ALGORITHM STRLENFriday, October 7, 11
  22. 22. NORMALIZE ALGORITHM STRLEN S-TRLEN ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  23. 23. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  24. 24. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN WILDCARD EXPANSION STR-LEN STRL_EN STRLE_N R*L*E*N* S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  25. 25. NORMALIZE ALGORITHM STRLEN E(ST) = {SET, STOP, STRING} S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN E(STR) = {STEER, STRING} STRL_EN E(LEN) = {LENDER, LENGTH} STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  26. 26. NORMALIZE ALGORITHM PART I STR VS STRING STEERFriday, October 7, 11
  27. 27. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTHFriday, October 7, 11
  28. 28. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTH 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRSFriday, October 7, 11
  29. 29. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRSFriday, October 7, 11
  30. 30. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  31. 31. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  32. 32. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB STRING 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  33. 33. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLENFriday, October 7, 11
  34. 34. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMENFriday, October 7, 11
  35. 35. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONSFriday, October 7, 11
  36. 36. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  37. 37. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  38. 38. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN STRING LENGTH 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  39. 39. ADDING CONTEXTFriday, October 7, 11
  40. 40. ADDING CONTEXT DIRFriday, October 7, 11
  41. 41. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY}Friday, October 7, 11
  42. 42. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD}Friday, October 7, 11
  43. 43. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD} FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS USED IN BOTH PART 1 AND PART 2Friday, October 7, 11
  44. 44. NORMALIZE IMPLEMENTATION USES GenTest TO SPLIT IDENTIFIERS RETURNS MULTIPLE SPLITS GOOGLE 5-GRAM DATASETFriday, October 7, 11
  45. 45. EVALUATION Program Loc SLoc Unique Ids which-2.20 3,670 2,293 487 a2ps-4.14 62,347 38,436 4,393 Program Selected Ids Hard Words Soft Words which-2.20 487 903 1214 a2ps-4.14 211 459 618Friday, October 7, 11
  46. 46. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMESFriday, October 7, 11
  47. 47. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMESFriday, October 7, 11
  48. 48. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Program Filtered Ids Reported Ids which-2.20 152 335 a2ps-4.14 46 166Friday, October 7, 11
  49. 49. EXAMPLE EXPANSIONS id Top 10 Top Expansion Expansion nextchar next_character next_character indfound index_found_need index_found optarg option_are_g optarg itemno i_them_not itemnoFriday, October 7, 11
  50. 50. RESEARCH QUESTIONS WHAT IS THE OVERALL ACCURACY OF NORMALIZE? DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY? CAN THE EXPANDER INFORM THE SPLITTER? CAN THE SPLITTER INFORM THE EXPANDER?Friday, October 7, 11
  51. 51. ACCURACY ON DOMAIN IDSFriday, October 7, 11
  52. 52. SOURCE OF EXPANSION WORDS SOURCE CODE INTERNAL DOCUMENTATION MANUALFriday, October 7, 11
  53. 53. BEST VOCABULARY SOURCE?Friday, October 7, 11
  54. 54. FUTURE WORK EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASKFriday, October 7, 11
  55. 55. SUMMARY IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS DEGRADES PERFORMANCE OF IR TECHNIQUES NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLYFriday, October 7, 11
  56. 56. QUESTIONS? Need an identifier split? GenTest Splitter available at splitit.cs.loyola.eduFriday, October 7, 11
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×