Relaxed Parsing of Regular Approximations of
String-Embedded Languages
Author: Ekaterina Verbitskaia
Saint Petersburg Stat...
String-embedded code
Embedded SQL
SqlCommand myCommand = new SqlCommand(
"SELECT * FROM table WHERE Column = @Param2",
myC...
Problems
String-embedded code are expressions in some programming language
It may be necessary to support them in IDE: cod...
Static analysis of string-embedded code
Performed without programm execution
Checks that the set of properties holds for e...
Existing tools
PHP String Analyzer, Java String Analyzer, Alvor
Static analyzers for PHP, Java, and SQL embedded into Java...
Static analysis of string-embedded code: the scheme
Identification of hotspots: points of interest, where the analysis is
...
Static analysis of string-embedded code: the scheme
Code: hotspot is marked string res = "";
for(i = 0; i < l; i++)
res = ...
Static analysis of string-embedded code: the scheme
Approximation
After lexing
Grammar
start ::= s
s ::= LBR s RBR s
s ::=...
Problem statement
The aim is to develop the algorithm suitable for syntactic analysis of
string-embedded code
Tasks:
Devel...
Algorithm
Input: reference DCF grammar G and DFA graph with no
𝜖-transitions over the alphabeth of terminals of G
Output: ...
Right-Nulled Generalized LR algorithm
RNGLR processes context free grammars
In case when LR conflicts occur, parses in eac...
RNGLR data structures: Graph-Structured Stack
0
2
a
3
b
9
B
4
A
6
a
8a
10
A
5
a
8
a
a
11a
1
S
Ekaterina Verbitskaia (SPbSU...
RNGLR operations: shift
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
Ekaterina Verbitskaia (SPbSU) 26/08/2015 13 / 27
RNGLR operations: shift
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
5
a
8
a
a
11
a
Ekaterina Verbitskaia (...
RNGLR operations: reduce
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
5
a
8
a
a
11
a
Ekaterina Verbitskaia (SPbSU) 26/08/2015 14 / 27
RNGLR operations: reduce
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
5
a
8
a
a
11
a
0
2
a
3
b
9
B
4
A
6
a
8a
10
A
5
a
8
a
a
11a
1
S
Ekat...
RNGLR
Read the input sequentially
Process all reductions
Shift the next token
Each time new vertex is added to the GSS, sh...
Algorithm
Traverse the automaton graph and sequentially construct GSS,
similarly as in RNGLR
New type of “conflict”: Shift...
Cycles processing: initial stack
Ekaterina Verbitskaia (SPbSU) 26/08/2015 17 / 27
Cycles processing: some edges were added
5
1
0
2
4
3
...
Ekaterina Verbitskaia (SPbSU) 26/08/2015 18 / 27
Cycles processing: cycle closed
5
1
0
2
4
3
...
Ekaterina Verbitskaia (SPbSU) 26/08/2015 19 / 27
Cycles processing: new reduction
5
1
0
2
4
3
...
Ekaterina Verbitskaia (SPbSU) 26/08/2015 20 / 27
Cycles processing: new reduction added
5
1
0
2
4
3
...
Ekaterina Verbitskaia (SPbSU) 26/08/2015 21 / 27
Parse forest construction
Shared Packed Parse Forest — graph, in which all derivation trees are
merged
Constructed in the ...
Algorithm: correctness
Correct tree — derivation tree of
some string accumulated along
the path in the input graph
n start...
Algorithm: correctness
Theorem (Termination)
Algorithm terminates for any input
Theorem (Correctness)
Every tree, generate...
Implementation
The algorithm is implemented as a part of YaccConstructor project
using F# programming language
The generat...
Evaluation
The data is taken from the project of migration from MS-SQL to
Oracle Server
2,7 lines of code, 2430 queries, 2...
Conclusion
The algorithm for parsing of regular approximation of dynamically
generated string which constructs the finite ...
Upcoming SlideShare
Loading in …5
×

Relaxed Parsing of Regular Approximations of String-Embedded Languages

370 views

Published on

We present a technique for syntax analysis of a regular set of input strings. This problem is relevant for analysis of string-embedded languages when a host program generates clauses of embedded language at run time. Our technique is based on a generalization of RNGLR algorithm, which, inherently, allows us to construct a finite representation of parse forest for regularly approximated set of input strings. This representation can be further utilized for semantic analysis and transformations in the context of reengineering, code maintenance, program understanding etc. The approach in question implements relaxed parsing: non-recognized strings in approximation set are ignored with no error detection.

Published in: Software
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
370
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Relaxed Parsing of Regular Approximations of String-Embedded Languages

  1. 1. Relaxed Parsing of Regular Approximations of String-Embedded Languages Author: Ekaterina Verbitskaia Saint Petersburg State University JetBrains Programming Languages and Tools Lab 26/08/2015 Ekaterina Verbitskaia (SPbSU) 26/08/2015 1 / 27
  2. 2. String-embedded code Embedded SQL SqlCommand myCommand = new SqlCommand( "SELECT * FROM table WHERE Column = @Param2", myConnection); myCommand.Parameters.Add(myParam2); Dynamic SQL IF @X = @Y SET @TBL = ’ #table1 ’ ELSE SET @TBL = ’ table2 ’ SET @S = ’SELECT x FROM’ + @TBL + ’WHERE ISNULL(n,0) > 1’ EXECUTE (@S) Ekaterina Verbitskaia (SPbSU) 26/08/2015 2 / 27
  3. 3. Problems String-embedded code are expressions in some programming language It may be necessary to support them in IDE: code highlighting, autocomplete, refactorings It may be necessary to transform them: migration of legacy software to new platforms It may be necessary to detect vulnerabilities in such code Any other problems of programming languages can occur Ekaterina Verbitskaia (SPbSU) 26/08/2015 3 / 27
  4. 4. Static analysis of string-embedded code Performed without programm execution Checks that the set of properties holds for each possible expression value Undecidable for string-embedded code in the general case The set of possible expression values is over approximated and then the approximation is analysed Ekaterina Verbitskaia (SPbSU) 26/08/2015 4 / 27
  5. 5. Existing tools PHP String Analyzer, Java String Analyzer, Alvor Static analyzers for PHP, Java, and SQL embedded into Java respectively Kyung-Goo Doh et al. Checks syntactical correctness of embedded code PHPStorm IDE for PHP with support of HTML, CSS, JavaScript IntelliLang PHPStorm and IDEA plugin, supports various languages STRANGER Vulnerability detection of PHP Flaws Limited functionality Hard to extend them with new features or support new languages Do not create structural representation of code Ekaterina Verbitskaia (SPbSU) 26/08/2015 5 / 27
  6. 6. Static analysis of string-embedded code: the scheme Identification of hotspots: points of interest, where the analysis is desirable Approximation construction Lexical analysis Syntactic analysis Semantic analysis Ekaterina Verbitskaia (SPbSU) 26/08/2015 6 / 27
  7. 7. Static analysis of string-embedded code: the scheme Code: hotspot is marked string res = ""; for(i = 0; i < l; i++) res = "()" + res; use(res); Possible values {"", "()", "()()", ..., "()"^l} Regular approximation ("()")* Approximation Ekaterina Verbitskaia (SPbSU) 26/08/2015 7 / 27
  8. 8. Static analysis of string-embedded code: the scheme Approximation After lexing Grammar start ::= s s ::= LBR s RBR s s ::= 𝜖 Parse forest start_rule prod 2 s ! prod 0 prod 0 LBR s RBR eps LBR s RBR eps Ekaterina Verbitskaia (SPbSU) 26/08/2015 8 / 27
  9. 9. Problem statement The aim is to develop the algorithm suitable for syntactic analysis of string-embedded code Tasks: Develop an algorithm for parsing of regular approximation of embedded code which produce a finite parse forest Parse forest should contain a parse tree for every correct (w.r.t. reference grammar) string accepted by the input automaton Incorrect strings should be omitted: no error detection An algorithm should not depend on the language of the host program and the language of embedded code Ekaterina Verbitskaia (SPbSU) 26/08/2015 9 / 27
  10. 10. Algorithm Input: reference DCF grammar G and DFA graph with no 𝜖-transitions over the alphabeth of terminals of G Output: finite representation of the trees corresponding to all correct string accepted by input automaton Ekaterina Verbitskaia (SPbSU) 26/08/2015 10 / 27
  11. 11. Right-Nulled Generalized LR algorithm RNGLR processes context free grammars In case when LR conflicts occur, parses in each possible way Shift/Reduce conflict Reduce/Reduce conflict Uses data structures which reduce memory consumption and guarantee appropriate time of analysis Ekaterina Verbitskaia (SPbSU) 26/08/2015 11 / 27
  12. 12. RNGLR data structures: Graph-Structured Stack 0 2 a 3 b 9 B 4 A 6 a 8a 10 A 5 a 8 a a 11a 1 S Ekaterina Verbitskaia (SPbSU) 26/08/2015 12 / 27
  13. 13. RNGLR operations: shift 0 2 a 3b 9 B 4 A 6 a 8 a 10 A Ekaterina Verbitskaia (SPbSU) 26/08/2015 13 / 27
  14. 14. RNGLR operations: shift 0 2 a 3b 9 B 4 A 6 a 8 a 10 A 0 2 a 3b 9 B 4 A 6 a 8 a 10 A 5 a 8 a a 11 a Ekaterina Verbitskaia (SPbSU) 26/08/2015 13 / 27
  15. 15. RNGLR operations: reduce 0 2 a 3b 9 B 4 A 6 a 8 a 10 A 5 a 8 a a 11 a Ekaterina Verbitskaia (SPbSU) 26/08/2015 14 / 27
  16. 16. RNGLR operations: reduce 0 2 a 3b 9 B 4 A 6 a 8 a 10 A 5 a 8 a a 11 a 0 2 a 3 b 9 B 4 A 6 a 8a 10 A 5 a 8 a a 11a 1 S Ekaterina Verbitskaia (SPbSU) 26/08/2015 14 / 27
  17. 17. RNGLR Read the input sequentially Process all reductions Shift the next token Each time new vertex is added to the GSS, shift is calculated Each time new edge is created in the GSS, reductions are calculated Ekaterina Verbitskaia (SPbSU) 26/08/2015 15 / 27
  18. 18. Algorithm Traverse the automaton graph and sequentially construct GSS, similarly as in RNGLR New type of “conflict”: Shift/Shift The set of LR-states is associated with each vertex of input graph The order in which the vertices of input graph are traversed is controled with a queue. Whenever new edge is added to GSS, its tail vertex is enqueued The algorithm implements relaxed parsing: errors are not detected, erroneous strings are ignored Ekaterina Verbitskaia (SPbSU) 26/08/2015 16 / 27
  19. 19. Cycles processing: initial stack Ekaterina Verbitskaia (SPbSU) 26/08/2015 17 / 27
  20. 20. Cycles processing: some edges were added 5 1 0 2 4 3 ... Ekaterina Verbitskaia (SPbSU) 26/08/2015 18 / 27
  21. 21. Cycles processing: cycle closed 5 1 0 2 4 3 ... Ekaterina Verbitskaia (SPbSU) 26/08/2015 19 / 27
  22. 22. Cycles processing: new reduction 5 1 0 2 4 3 ... Ekaterina Verbitskaia (SPbSU) 26/08/2015 20 / 27
  23. 23. Cycles processing: new reduction added 5 1 0 2 4 3 ... Ekaterina Verbitskaia (SPbSU) 26/08/2015 21 / 27
  24. 24. Parse forest construction Shared Packed Parse Forest — graph, in which all derivation trees are merged Constructed in the same manner as in the RNGLR-algorithm Fragments of derivation trees are associated with GSS edges Each time shift is processed, new tree of one terminal is created Each time reduction is processed, new tree with children accumulated along the paths is created Children are not copied, they are reused The root of resulting tree is associated with GSS edge, corresponding to reduction to the starting nonterminal Unreachable vertices are deleted from resulting graph Ekaterina Verbitskaia (SPbSU) 26/08/2015 22 / 27
  25. 25. Algorithm: correctness Correct tree — derivation tree of some string accumulated along the path in the input graph n start n s t LBR n s t RBR n s eps t LBR n s t RBR eps Ekaterina Verbitskaia (SPbSU) 26/08/2015 23 / 27
  26. 26. Algorithm: correctness Theorem (Termination) Algorithm terminates for any input Theorem (Correctness) Every tree, generated from SPPF, is correct Theorem (Correctness) For every path p in the inner graph, recognized w.r.t. reference grammar, a correct tree corresponding to p can be generated from SPPF Ekaterina Verbitskaia (SPbSU) 26/08/2015 24 / 27
  27. 27. Implementation The algorithm is implemented as a part of YaccConstructor project using F# programming language The generator of RNGLR parse tables and data structures for GSS and SPPF are reused Ekaterina Verbitskaia (SPbSU) 26/08/2015 25 / 27
  28. 28. Evaluation The data is taken from the project of migration from MS-SQL to Oracle Server 2,7 lines of code, 2430 queries, 2188 successfully processed The number of queries which previously could not be processed because of timeout is decreased from 45 to 1 Ekaterina Verbitskaia (SPbSU) 26/08/2015 26 / 27
  29. 29. Conclusion The algorithm for parsing of regular approximation of dynamically generated string which constructs the finite representation of parse forest is developped Its termination and correctness are proved The algorithm is implemented as a part of YaccConstructor project The evaluation demonstrated it could be used for complex tasks Ekaterina Verbitskaia (SPbSU) 26/08/2015 27 / 27

×