The Art Of Parsing @ Devoxx France 2014
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

The Art Of Parsing @ Devoxx France 2014

on

  • 247 views

What attracts researchers starting from the 60s till nowadays? What is studied in university by engineers in computer science and then successfully forgotten? What is at the heart of the compilers ...

What attracts researchers starting from the 60s till nowadays? What is studied in university by engineers in computer science and then successfully forgotten? What is at the heart of the compilers used daily by any software developer? Parsers! From a practical point of view using a small pill of theory, this session will bring lights on questions like: if there is so many parser-generators based on formal theory, then why javac, GCC and Clang are all hand-written? And how we, insiders of the world of parsing, do this at SonarSource for languages like Java, C/C++, C#, JavaScript, Python, COBOL?

Statistics

Views

Total Views
247
Views on SlideShare
243
Embed Views
4

Actions

Likes
0
Downloads
2
Comments
0

2 Embeds 4

https://twitter.com 2
http://www.slideee.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The Art Of Parsing @ Devoxx France 2014 Presentation Transcript

  • 1. @dbolkensteyn @_godin_#parsing The Art of Parsing Evgeny Mandrikov @_godin_ Dinesh Bolkensteyn @dbolkensteyn http://sonarsource.com
  • 2. @dbolkensteyn @_godin_#parsing 2/56 The Art of Parsing // TODO: don't forget to add huge disclaimer that all opinions hereinbelow are our own and not our employer (they wish they had them) Evgeny Mandrikov @_godin_ Dinesh Bolkensteyn @dbolkensteyn
  • 3. @dbolkensteyn @_godin_#parsing 3/56 I want to create a parser «Done»! Use Yacc, JavaCC, ANTLR, SSLR, … or hand-written ?
  • 4. @dbolkensteyn @_godin_#parsing 4/56 What is the plan? Why • javac and GCC are hand-written • do we use parser-generators ? Together we will implement parser for • arithmetic expressions • common constructions from Java • C++ ;)
  • 5. @dbolkensteyn @_godin_#parsing 5/56 Java formal grammar JLS8 JLS7
  • 6. @dbolkensteyn @_godin_#parsing 6/56 Answer is 42
  • 7. @dbolkensteyn @_godin_#parsing 7/56 Pill of theory NUM ➙ 42 Nonterminal Productions Terminals (tokens)
  • 8. @dbolkensteyn @_godin_#parsing 8/56 Grammar for numbers NUM ➙ NUM DIGIT | DIGIT DIGIT ➙ 0|1|2|3|4|5|6|7|8|9 4, 8, 15, 16, 23, 42,… Alternatives
  • 9. @dbolkensteyn @_godin_#parsing 9/56 Arithmetic expressions 4 – 3 – 2 = ?
  • 10. @dbolkensteyn @_godin_#parsing 10/56 expr ➙ expr – expr | NUM Arithmetic expressions 4 – 3 – 2 = ?
  • 11. @dbolkensteyn @_godin_#parsing 11/56 Arithmetic expressions expr 4 3 2 expr expr ➙ expr – expr | NUM (4 – 3)– 2 =-1
  • 12. @dbolkensteyn @_godin_#parsing 12/56 Arithmetic expressions 4 3 2 expr expr expr ➙ expr – expr | NUM (4 – 3)– 2 =-1 4 –(3 – 2)= 3 expr 4 3 2 expr
  • 13. @dbolkensteyn @_godin_#parsing 13/56 Arithmetic expressions expr ➙ NUM – expr | NUM expr ➙ expr – expr | NUM (4 – 3)– 2 =-1 4 –(3 – 2)= 3 expr 4 3 2 expr 4 3 2 expr expr
  • 14. @dbolkensteyn @_godin_#parsing 14/56 Arithmetic expressions expr ➙ NUM – expr | NUM expr ➙ expr – expr | NUM expr ➙ expr – NUM | NUM (4 – 3)– 2 =-1 4 –(3 – 2)= 3 4 3 2 expr expr expr 4 3 2 expr
  • 15. @dbolkensteyn @_godin_#parsing 15/56 Show me the code int expr() { int res = expr(); if (token == '–') return res – num(); return num(); } int expr() { int res = expr(); if (token == '–') return res – num(); return num(); } expr ➙ expr – NUM | NUM
  • 16. @dbolkensteyn @_godin_#parsing 16/56 Show me the code right code ?? int expr() { int res = expr(); if (token == '–') return res – num(); return num(); } int expr() { int res = expr(); if (token == '–') return res – num(); return num(); } expr ➙ expr – NUM | NUM
  • 17. @dbolkensteyn @_godin_#parsing 17/56 Show me the code right code int expr() { int res = expr(); if (token == '–') return res – num(); return num(); } int expr() { int res = expr(); if (token == '–') return res – num(); return num(); } expr ➙ expr – NUM | NUM int expr() { int res = num(); while (token == '–') res = res – num(); return res; } int expr() { int res = num(); while (token == '–') res = res – num(); return res; }
  • 18. @dbolkensteyn @_godin_#parsing 18/56 Arithmetic expressions 4 – 3 * 2 = ?
  • 19. @dbolkensteyn @_godin_#parsing 19/56 Arithmetic expressions 4 – 3 * 2 = -2 expr ➙ expr – NUM | expr * NUM | NUM
  • 20. @dbolkensteyn @_godin_#parsing 20/56 Arithmetic expressions 4 –(3 * 2)= -2 (4 – 3)* 2 = 2 expr ➙ expr – NUM | expr * NUM | NUM
  • 21. @dbolkensteyn @_godin_#parsing 21/56 Arithmetic expressions subs ➙ subs – mult | mult mult ➙ mult * NUM | NUM 4 –(3 * 2)= -2
  • 22. @dbolkensteyn @_godin_#parsing 22/56 Show me the code int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; } int mult() { int res = num(); while (token == '*') res = res * num(); return res; } int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; } int mult() { int res = num(); while (token == '*') res = res * num(); return res; } subs ➙ subs – mult | mult mult ➙ mult * NUM | NUM
  • 23. @dbolkensteyn @_godin_#parsing 23/56 LL(1) ● back to 1969 ● one token lookahead ● no left-recursion
  • 24. @dbolkensteyn @_godin_#parsing 24/56 What is the plan? ✔ arithmetic expressions ✔ LL(1) • a few common constructions from Java • C++ ;)
  • 25. @dbolkensteyn @_godin_#parsing 25/56 The real deal expr-stmt ➙ expr ; obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 26. @dbolkensteyn @_godin_#parsing 26/56 The real deal expr-stmt ➙ expr ; expr ➙ field-access | method-call | assignment obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 27. @dbolkensteyn @_godin_#parsing 27/56 The real deal expr-stmt ➙ expr ; expr ➙ field-access | method-call | assignment field-access ➙ qualified-id qualified-id ➙ qualified-id . id | id obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 28. @dbolkensteyn @_godin_#parsing 28/56 The real deal expr-stmt ➙ expr ; expr ➙ field-access | method-call | assignment field-access ➙ qualified-id qualified-id ➙ qualified-id . id | id method-call ➙ qualified-id () obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 29. @dbolkensteyn @_godin_#parsing 29/56 The real deal expr-stmt ➙ expr ; expr ➙ field-access | method-call | assignment field-access ➙ qualified-id qualified-id ➙ qualified-id . id | id method-call ➙ qualified-id () assignment ➙ qualified-id = expr obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 30. @dbolkensteyn @_godin_#parsing 30/56 int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ } int expr() { // ??? } int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ } int expr() { // ??? } Show me the code expr-stmt ➙ expr ; expr ➙ field-access | method-call | assignment field-access ➙ qualified-id qualified-id ➙ qualified-id . id | id method-call ➙ qualified-id () assignment ➙ qualified-id = expr obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 31. @dbolkensteyn @_godin_#parsing 31/56 int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); } int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); } The LL(1) way expr ➙ field-access | method-call | assignment obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 32. @dbolkensteyn @_godin_#parsing 32/56 Reality http://hg.openjdk.java.net/jdk8/jdk8/langtools/.../JavacParser.java
  • 33. @dbolkensteyn @_godin_#parsing 33/56 The better way expr ➙ field-access | method-call | assignment int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } } int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } } obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 34. @dbolkensteyn @_godin_#parsing 34/56 int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } } int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } } Show me the code right code expr ➙ method-call / assignment / field-access obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 35. @dbolkensteyn @_godin_#parsing 35/56 Parsing Expression Grammars ● 2002 ● ordered choice «/» ● backtracking ● no left-recursion
  • 36. @dbolkensteyn @_godin_#parsing 36/56 enum Nonterminals { EXPR, METHOD_CALL, … } void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); } enum Nonterminals { EXPR, METHOD_CALL, … } void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); } DSL for PEG expr ➙ method-call / assignment / field-access obj.method(); a = obj.field; obj.method(); a = obj.field;
  • 37. @dbolkensteyn @_godin_#parsing 37/56 What is the plan? ✔ arithmetic expressions ✔ LL(1) ✔ common constructions from Java ✔ PEG • C++ ;)
  • 38. @YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing Tea Break
  • 39. @dbolkensteyn @_godin_#parsing 39/56 if (false) if (true) System.out.println("foo"); else System.out.println("bar"); if (false) if (true) System.out.println("foo"); else System.out.println("bar"); Quiz
  • 40. @dbolkensteyn @_godin_#parsing 40/56 if (false) if (true) System.out.println("foo"); else System.out.println("bar"); if (false) if (true) System.out.println("foo"); else System.out.println("bar"); «Dangling else» if-stmt ➙ IF (cond) stmt ELSE stmt / IF (cond) stmt
  • 41. @dbolkensteyn @_godin_#parsing 41/56 Java is awesome (A)*B(A)*B
  • 42. @dbolkensteyn @_godin_#parsing 42/56 C++ all the pains of the world int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B' int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A' int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B' int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A' Java is good, because it was influenced by bad experience of C++ (A)*B(A)*B
  • 43. @dbolkensteyn @_godin_#parsing 43/56 rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID)); rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID)); Hit the wall ! (A)*B(A)*B
  • 44. @dbolkensteyn @_godin_#parsing 44/56 rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID)); rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID)); Hit the wall ! (A)*B(A)*B
  • 45. @dbolkensteyn @_godin_#parsing 45/56 Dream mul-expr ➙ mul-expr * unary-expr | unary-expr unary-expr ➙ ( type-id ) unary-expr   | * unary-expr | primary primary ➙ ( expr ) | id (A)*B(A)*B
  • 46. @dbolkensteyn @_godin_#parsing 46/56 Generalized parsers ● Earley (1968) ● slow ● GLR (1984) ● complex
  • 47. @dbolkensteyn @_godin_#parsing 47/56 Chicken and egg problem (A)*B unary-expr mul-expr (A) (A)*B B*... (A)*B(A)*B mul-expr ➙ mul-expr * unary-expr | unary-expr unary-expr ➙ ( type-id ) unary-expr   | * unary-expr | primary primary ➙ ( expr ) | id
  • 48. @dbolkensteyn @_godin_#parsing 48/56 Back to the future «dangling else»  if (…) if (…) then-stmt else else-stmt if (…) if (…) then-stmt else else-stmt outer-if inner-if inner-if then-stmt else-stmt inner-if · else-stmt
  • 49. @dbolkensteyn @_godin_#parsing 49/56 GLL : How does it work ? mul-expr ➙ mul-expr * unary-expr | unary-expr
  • 50. @dbolkensteyn @_godin_#parsing 50/56 Generalized LL ● 2010 ● no grammar left behind (left-recursive, ambiguous) ● simpler than GLR ● syntactic ambiguities
  • 51. @YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing Sum m ary
  • 52. @dbolkensteyn @_godin_#parsing 52/56 Summary LL(1) • trivial • major grammar changes • only good for arithmetic expressions • on steroids as in JavaCC usable for real languages
  • 53. @dbolkensteyn @_godin_#parsing 53/56 Summary PEG • trivial • fewer grammar changes • no ambiguities • usable for real languages • nice tools such as SSLR • dead-end for C/C++
  • 54. @dbolkensteyn @_godin_#parsing 54/56 Summary GLL • any grammar • relatively simple • ambiguities • reasonable performances • the only clean choice for C/C++ • only «academic» tools for now... ;)
  • 55. @dbolkensteyn @_godin_#parsing 55/56 Summary Hand-written ● based on LL(1) ● precise error-reporting and recovery ● best performances ● maintainance hell
  • 56. @YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing Q & A