Your SlideShare is downloading. ×
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

376

Published on

Paper: Expanding Identifiers to Normalize Source Code Vocabulary …

Paper: Expanding Identifiers to Normalize Source Code Vocabulary

Authors: Dave Binkley and Dawn Lawrie

Session: Research Track 4: Natural Language Analysis

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
376
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. EXPANDING IDENTIFIERS TO NORMALIZING SOURCE CODE VOCABULARY PRESENTED BY DAWN LAWRIE LOYOLA UNIVERSITY MARYLAND IN COLLABORATION WITH DAVE BINKLEYFriday, October 7, 11
  • 2. VOCABULARY MISMATCH DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS EXAMPLE REQUIREMENT - “FEATURE LOCATION” SOURCE CODE - “FEATURELOCATION” OR WORSE “FLOC”Friday, October 7, 11
  • 3. PURPOSE OF NORMALIZE COPE WITH VOCABULARY MISMATCH SOURCE CODE OTHER SOFTWARE DOCUMENTSFriday, October 7, 11
  • 4. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURELOCATION FLOCFriday, October 7, 11
  • 5. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FLOCFriday, October 7, 11
  • 6. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM F LOC SPLITTING PROBLEMFriday, October 7, 11
  • 7. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FEATURE LOCATION SPLITTING AND EXPANSION PROBLEMFriday, October 7, 11
  • 8. WHY NORMALIZE? MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDSFriday, October 7, 11
  • 9. NORMALIZE PROBLEM STATEMENT FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS FLOC FEATURE LOCATIONFriday, October 7, 11
  • 10. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN SOFT-WORD - WHITE-HOUSE_LAWNFriday, October 7, 11
  • 11. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWNFriday, October 7, 11
  • 12. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWN (3)Friday, October 7, 11
  • 13. NORMALIZE ALGORITHMFriday, October 7, 11
  • 14. NORMALIZE ALGORITHM STRLEN STRING LENGTHFriday, October 7, 11
  • 15. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIAFriday, October 7, 11
  • 16. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HITFriday, October 7, 11
  • 17. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HITFriday, October 7, 11
  • 18. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONGFriday, October 7, 11
  • 19. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONGFriday, October 7, 11
  • 20. NORMALIZE ALGORITHMFriday, October 7, 11
  • 21. NORMALIZE ALGORITHM STRLENFriday, October 7, 11
  • 22. NORMALIZE ALGORITHM STRLEN S-TRLEN ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 23. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 24. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN WILDCARD EXPANSION STR-LEN STRL_EN STRLE_N R*L*E*N* S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 25. NORMALIZE ALGORITHM STRLEN E(ST) = {SET, STOP, STRING} S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN E(STR) = {STEER, STRING} STRL_EN E(LEN) = {LENDER, LENGTH} STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-NFriday, October 7, 11
  • 26. NORMALIZE ALGORITHM PART I STR VS STRING STEERFriday, October 7, 11
  • 27. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTHFriday, October 7, 11
  • 28. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTH 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRSFriday, October 7, 11
  • 29. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRSFriday, October 7, 11
  • 30. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 31. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 32. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB STRING 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 33. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLENFriday, October 7, 11
  • 34. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMENFriday, October 7, 11
  • 35. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONSFriday, October 7, 11
  • 36. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 37. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 38. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN STRING LENGTH 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESIONFriday, October 7, 11
  • 39. ADDING CONTEXTFriday, October 7, 11
  • 40. ADDING CONTEXT DIRFriday, October 7, 11
  • 41. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY}Friday, October 7, 11
  • 42. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD}Friday, October 7, 11
  • 43. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD} FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS USED IN BOTH PART 1 AND PART 2Friday, October 7, 11
  • 44. NORMALIZE IMPLEMENTATION USES GenTest TO SPLIT IDENTIFIERS RETURNS MULTIPLE SPLITS GOOGLE 5-GRAM DATASETFriday, October 7, 11
  • 45. EVALUATION Program Loc SLoc Unique Ids which-2.20 3,670 2,293 487 a2ps-4.14 62,347 38,436 4,393 Program Selected Ids Hard Words Soft Words which-2.20 487 903 1214 a2ps-4.14 211 459 618Friday, October 7, 11
  • 46. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMESFriday, October 7, 11
  • 47. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMESFriday, October 7, 11
  • 48. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Program Filtered Ids Reported Ids which-2.20 152 335 a2ps-4.14 46 166Friday, October 7, 11
  • 49. EXAMPLE EXPANSIONS id Top 10 Top Expansion Expansion nextchar next_character next_character indfound index_found_need index_found optarg option_are_g optarg itemno i_them_not itemnoFriday, October 7, 11
  • 50. RESEARCH QUESTIONS WHAT IS THE OVERALL ACCURACY OF NORMALIZE? DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY? CAN THE EXPANDER INFORM THE SPLITTER? CAN THE SPLITTER INFORM THE EXPANDER?Friday, October 7, 11
  • 51. ACCURACY ON DOMAIN IDSFriday, October 7, 11
  • 52. SOURCE OF EXPANSION WORDS SOURCE CODE INTERNAL DOCUMENTATION MANUALFriday, October 7, 11
  • 53. BEST VOCABULARY SOURCE?Friday, October 7, 11
  • 54. FUTURE WORK EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASKFriday, October 7, 11
  • 55. SUMMARY IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS DEGRADES PERFORMANCE OF IR TECHNIQUES NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLYFriday, October 7, 11
  • 56. QUESTIONS? Need an identifier split? GenTest Splitter available at splitit.cs.loyola.eduFriday, October 7, 11

×