Source Code Clone Search (Iman keivanloo PhD seminar)

3,102 views

Published on

Source Code Clone Search and Detection (SeClone is a Real-time and Internet-scale Clone Search and Detection).

*There are some animations in the presentation, to see them download and run it locally.

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • There are lots of animations which would be helpful. I suggest you to download the presentation file and run it locally.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total views
3,102
On SlideShare
0
From Embeds
0
Number of Embeds
221
Actions
Shares
0
Downloads
25
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • use of method names in queries resulted in a 98% "click rate" vs. 68% for queries without method names
  • http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  • http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  • http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  • http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  • http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Ftitle%0D%0AWHERE+{%0D%0A++++%3Fgame+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3AFirst-person_shooters%3E+.%0D%0A++++%3Fgame+foaf%3Aname+%3Ftitle+.%0D%0A}%0D%0Alimit+3&debug=on&timeout=&format=text%2Fhtml&save=display&fname=
  • Source Code Clone Search (Iman keivanloo PhD seminar)

    1. 1. Internet-scale Source Code Search and Analysis Framework Iman Keivanloo Advisor: Dr. Juergen RillingPhD SeminarComputer Science and Software Engineering DepartmentNovember-17-2011
    2. 2. Agenda• Research Context• Major questions & answers• Next step• Conclusion• Time Table 2
    3. 3. Research Context Internet-Scale Code Search“is searching the Internet for source code to help solve a software development problem” [Gallardo, SUITE’09] 3
    4. 4. How to search for Source Code?• Free-form Query: – “how to write into file in Java”• Structural Query: – “select col1 from table1 where col1=“%write” [Keivanloo, SUITE’11] [Keivanloo, ICSM’10] 4
    5. 5. Research Focus Similar Fragment Search XMLReadFile inFile=new XMLReadFile(“kb.xml”); Suggested simplified query: Window myWindow=new Window(); Select line which has The ideal expected asnwer myWindow.trigger(inFile);(1) a method call statement on the trigger method. OutputStream result=new OutputStream(); myWindow.flush(result);Step 1: Input [the simplified structural query] Step 2: Input [the selected fragment in the first step and its target line (red)] Internet-Scale Structural Code Real-time Clone Search Engine Search Engine ... ... 10: Window myWindow=new Window(); The pattern is 11: CSVReadFile csvData=new CSVReadFile(“... ... similar but it uses 59: Event e=new Event(50); 55: Window r=new Window(); 12: myWindow.trigger(csvData); XMLStream instead 60: e.trigger(); 13: OutputStream o=new OutputStream(); 56: long timestamp=System.Now(); 61: e.update(); Gapped clone of XMLFile as the 14: myWindow.flush(o); 57: System.out.println(“Start reasoning...”); ... 15: myWindow.close(); 58: XMLStream xmldata=new XMLStream(io); input ... ... 59: r.trigger(xmldata); 11: CSVReadFile csvData=new CSVReadFile(“input.csv”); 60: OutputStream o=new OutputStream(); 12: myWindow.trigger(csvData); 61: r.flush(o); 13: OutputStream o=new OutputStream(); … … This match is This line looks like a match, however it uses … acceptable, even if ... 89: Window var=new Window(); .CSV instead of .XML. We can use our clone 90: XMLReadFile r=new XMLReadFile (“k.xml”); the order is different 133: Listener res=new Listener(); search engine to find now other similar 91: OutputStream o=new OutputStream(); Unordered core from the 1:1 match 134: res.trigger(“warm-up”); 135: res.close(); code fragments to this one. 92: var.trigger(r); ... 93: var.flush(o); … 5
    6. 6. Research Challenge 6
    7. 7. The Web Search Challenge 7
    8. 8. But Often Still Fail to Deliver the Expected Results After 10 Years of Research 8
    9. 9. No Ambiguity! 9
    10. 10. Early ConclusionSource Code Search is similar to Web Search 10
    11. 11. Early ConclusionSource Code Search is similar to Web Search1. Search techniques = ? Search Analysis (Ambiguity resolution)2. Ambiguity resolution techniques = Code Analysis 11
    12. 12. Research Approach Overview Internet-scale Source CodeSearch and Analysis FrameworkSearch Analysis Code Clone Search Semantic Web-based Code Analysis 12
    13. 13. Definitions & Requirements Search
    14. 14. Clone (Source Code Clone) • Similar code fragments for (AttributeEntity for (AttributeEntity theAttributeEntity:aTableEntity.ge…theAttributeEntity:aTableEntity.ge… System.out.println(“Hello!"); System.out.println(“Hello!"); • Type 1: Identical except whitespaces … • Type 2: Identical except variable names ... • Type 3: Identical except a few missing… • Type 4: Similar functionality[Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection techniquesand tools: A qualitative approach. Science of Computer Programming, 2009.] 14
    15. 15. Clone Search Query Code Databasefor (Attributeattribute:exampleSet.getAttributes()) for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");System.out.println(“Hello!"); for (IAttribute att:source.getAttributes()) { System.out.println("Please do not read me"); for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test"); 16
    16. 16. Clone Search Answer Query 17
    17. 17. Internet-scale Clone Search Queryfor (Attributeattribute:exampleSet.getAttributes())System.out.println(“Hello!"); 18
    18. 18. Internet-scale Real-time Clone Search 19
    19. 19. Internet-scale Real-time Clone Search Requirements? 20
    20. 20. Internet-scale Real-time Clone Search Requirements: Millions LOC ~ 300 MLOC 21
    21. 21. Internet-scale Real-time Clone Search Requirements: 100 Millions LOC Milliseconds 22
    22. 22. Internet-scale Real-time Clone Search for (IAttribute att:source.getAttributes()) { System.out.println("Please do not read me"); for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test"); Requirements: •Precision 100 • Recall Millions LOC Milliseconds •Type-1, 2, 3… 23
    23. 23. Internet-scale Real-time Clone Search Requirements: Precision Recall Millions LOC 100 Milliseconds Type-1, 2, 3… 24
    24. 24. Research Question #1 Real-time answer (faster than 100 ms) Is it actually possible?
    25. 25. Our Initial Analysis• SeClone: An Internet-scale Real-time Clone Search Engine Search Analysis Phase 1 Phase 2 [Keivanloo, ICPC’11] 26
    26. 26. Inside SeClone Phase 1 • Syntactical Pattern matchingPhase 1 Phase 2Pattern Matching 27
    27. 27. Inside SeClone Phase 2 • Information Retrieval & Clustering algorithm 1 for (Attribute attribute:exampleSet.getAttribute System.out.println(“The end"); 2 for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");Phase 1 Phase 2Pattern Matching Semantic Matching 3 for (AttributeEntity theAttributeEntity:aTable System.out.println(“Hello!"); 4 for (JAttribute attribute:formType.getAttribute System.out.println(“Test"); 5 for (IAttribute att:source.getAttributes()) { 28 System.out.println("Please do not read m
    28. 28. Research Question #2 The Dilemma How to distribute the 100 milliseconds between phases? 0 25 50 75 100 Pattern Matching Semantic Matching [Keivanloo, WCRE’11]
    29. 29. Our Further Analysis [WCRE’11] • 100 MillisecondsRequirements • Millions LOC • Precision The Dilemma • Recall Constraints • Type-1, 2, 3… 0 25 50 75 100SeClone [ICPC 11] O ( p * log n ) Pattern Matching Semantic MatchingData Characteristics 30
    30. 30. Source Code Characteristics 31
    31. 31. Analysis of the Data Characteristics: Dataset preparation• Name: IJaDataset – Comprehensive (Inter-project) • To avoid project-specific result – ~18,000 Projects – 1,500,000 unique Java classes • No duplicate, empty, buggy file – ~300 MLOC• online at http://aseg.cs.concordia.ca/seclone 32
    32. 32. Analysis of the Data Characteristics: Granularity Effect• Three Level Similarity (TLS): Set of similar three-line fragments• First Level Similarity (FLS): single-line patterns 33
    33. 33. Analysis of the Data Characteristics: Clone frequency• How many code fragment are analyzed by each query?• Answer: 3 (Average) 34
    34. 34. Analysis of the Data Characteristics: Clone frequency• Observation result: – TLS distributes the candidates into 3.9 times more groups – Its group size is 6 times smaller than FLS 35
    35. 35. Analysis of the Data Characteristics: Clone frequency• Conclusion: – TLS heuristic is practical for real-time clone search, as long as the outliers are handled properly – Why? • (1) each TLS group has 2.37 members on average • (2) it distributes candidates in small-size groups • (3) for each query, only one group must be evaluated 36
    36. 36. What Does an Outlier Look Like?• Outlier Definition: patterns with more than 2,000 occurrences• Observation result: • Only ~1000 patterns out of 30M • ~ 0.01% patterns • Mostly insignificant code patterns 37
    37. 37. Analysis of the Data Characteristics: Sampling efficiency• Can sampling be used to reduce the amount of data being analyzed?• Answer: Yes (e.g., 33% contains 91% of popular patterns) 38
    38. 38. Analysis of the Data Characteristics: Indexing• Can 32bit Hash keys (versus MD5) be used without affecting index quality? abc  123 abc  123 aXc  456 aXc  123• Answer: Yes 0.002% error rate Only 10 cases for same key for three distinct strings 39
    39. 39. Method Names Are Reliable?• Input Data: Koders 1-year query log – ~10M records• Observation purpose: – Importance of method names• Observation result: – 98% success rate vs. 69%• Result interpretation: – Method names in this context are reliable source of information – They must be preserved to increase precision 40
    40. 40. Source Code Search Framework 41
    41. 41. Internet-scale Real-time Code Clone Search via Multi-level Indexing– Internet-scale & Speed • 32-bit Hash values– Type-3 clone • Multi-level indexing– Customized for Internet-scale Code Search • Special transformation rule 42
    42. 42. Response Time (Pattern Matching) [WCRE’11]• Regular queries – 25 microseconds• 99.99% queries – 900 microseconds 43
    43. 43. Conclusion 44
    44. 44. Answer: Research Question #1Internet-scale Real-time Code Search Is Possible? YES 45
    45. 45. Answer: Research Question #2 The DilemmaHow to distribute the 100 milliseconds between phases? Answer: 0 25 50 75 100 Pattern Matching Semantic Matching 1 millisecond 99 milliseconds
    46. 46. Research Opportunity 0 25 50 75 100 Pattern Matching Semantic Matching 99 milliseconds Analysis
    47. 47. Summary Step 1• Studied characteristics of source code on the Internet – unique patterns distribution (sampling application) – Pattern frequencies (multi-level search) – 32-bit hashing strength (code pattern) – Outlier patterns – Method name importance Step 2• Designed an Internet-scale clone search – Customized for code search (precision) – Fine granularity – Multi-level Indexing approach (Type-3 clone) – Microsecond range response time (up to 10 times faster) 48
    48. 48. Publication Code Clone Search and Detection (http://aseg.cs.concordia.ca/seclone/)• Iman Keivanloo, Juergen Rilling, Philippe Charland. Internet-scale Real-time Code Clone Search via Multi-level Indexing. 18th Working Conference on Reverse Engineering (WCRE 2011), Lero, Limerick , Ireland.• Iman Keivanloo, Juergen Rilling, Philippe Charland. SeClone – A Hybrid Approach to Internet-Scale Real-Time Code Clone Search. 19th IEEE International Conference on Program Comprehension (ICPC 2011), Kingston, Ontario, Canada. Source Code Sharing using Linked Data (secold.org)• Iman Keivanloo, Chris Forbes, Juergen Rilling, and Philippe Charland, "Towards Sharing Source Code Facts Using Linked Data," ICSE Workshop on Search-Driven Development: Users, Infrastructure, Tools and Evaluation (SUITE). 2011. Source Code Search (http://aseg.cs.concordia.ca/codesearch)• Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. Semantic Web-based Source Code Search. 6th International Workshop on Semantic Web Enabled Software Engineering (SWESE 2010), June 35, San Francisco, USA.• Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. SE-CodeSearch: A Scalable Semantic Web- based Source Code Search Infrastructure. 26th IEEE International Conference on Software Maintenance (ICSM), Early Research Achievements (ERA) Track, Sept. 12-18, Timișoara, Romania. 49
    49. 49. Thank you for your kind attention QUESTION?PhD SeminarComputer Science and Software Engineering Department 50November-17-2011

    ×