Some people, when faced with a problem think,
         “I know, I’ll use regular expressions”.
         Now they have two ...
Example 1



Tuesday, 19 May 2009
Tuesday, 19 May 2009

I run a film listing site: http://filmli.st. All the data is scraped from other sites - getting the
da...
<span>
               Fri/Sun-Tue 10.45 12.30 (Tue) 12.40 (not Tue)
               4.00 7.00 9.30; Wed 3.00 7.30 9.00
    ...
Example 2



Tuesday, 19 May 2009
Tuesday, 19 May 2009

Chatroom bots need to be able to distinguish between messages that they should take
actions on and t...
/^s*whereiss+(.+?)(?:s+(?:ons+)?(.+?))?s*$/




Tuesday, 19 May 2009

Regular expressions? Pretty confusing.
whereis <person> [[on] <day>]




Tuesday, 19 May 2009

Much nicer to have a simpler language.
Example 3



Tuesday, 19 May 2009
Scenario: producing human-readable tests
                 Given I have non-technical stakeholders
                 When I ...
Tuesday, 19 May 2009

They have! Cucumber. Cucumber’s implementation got me started looking into...
Tuesday, 19 May 2009

Treetop. A ruby Parsing Expression Grammar. Basically a parser generator, but really simple.
What is a parser?



Tuesday, 19 May 2009

A parser determines whether strings are syntactically valid according to a set ...
Yes / No



Tuesday, 19 May 2009

From a theoretical viewpoint, parsers just say true or false, depending on whether the s...
Syntax Tree



Tuesday, 19 May 2009

Not so useful, so instead we get back a syntax tree we can do useful things with.
whereis <person> [on <day>]




Tuesday, 19 May 2009

Lets try building a tree for this example. You can consider a string...
words          words

                        whereis <person> [on <day>]




Tuesday, 19 May 2009

We have some words...
words   variable   words   variable

                       whereis <person>    [on     <day>]




Tuesday, 19 May 2009

v...
optional part

                        words      variable      words       variable

                       whereis <pers...
expression

                                             optional part

                       words     variable   words ...
grammar Message
               end




Tuesday, 19 May 2009

lets build that up in treetop. Each of those four types of no...
grammar Message
                 rule expression
                   (words / variable / optional_part)+
                 e...
grammar Message
                 rule expression
                   (words / variable / optional_part)+
                 e...
grammar Message
                 rule expression
                   (words / variable / optional_part)+
                 e...
grammar Message
                 rule expression
                   (words / variable / optional_part)+
                 e...
$ tt message.treetop




Tuesday, 19 May 2009

We compile the grammar with the command line tt command - you can also load...
require ‘message’

               parser = MessageParser.new
               tree = parser.parse(“whereis <person>...”)



...
require ‘message’

               parser = MessageParser.new
               tree = parser.parse(“whereis <person>...”)

  ...
Fri/Sun-Tue 4.00 7.00




Tuesday, 19 May 2009

Another example. This time we’ll think about the tree in a top down fashio...
expression




                       Fri/Sun-Tue 4.00 7.00




Tuesday, 19 May 2009
expression

                         days                       times




                       Fri/Sun-Tue              ...
expression

                                 days                             times

                       day        day...
expression

                               days                               times

                       day      day r...
rule expression
                 days “ ” times
               end




Tuesday, 19 May 2009
rule times
                 time (“ ” time)+
               end

               rule time
                 hours “.” minut...
rule days
                 (day !“-” / day_range) (“/” days)?
               end

               rule day_range
          ...
Enriching Nodes



Tuesday, 19 May 2009

Adding in some semantics
rule time
                 hours “.” minutes
               end


               irb> aTimeNode.text_value #=> “9.00”
    ...
rule time
                 hours “.” minutes {
                   def to_seconds
                     hours.to_i * 60 * 60...
# in film_time.treetop
               rule time
                 hours “.” minutes <TimeNode>
               end

        ...
Interpretation &
                         Compilation



Tuesday, 19 May 2009

We’re going to build up a regular expressio...
expression

                                            optional part

                        words   variable   words   ...
expression

                                            optional part

                        words   variable   words   ...
expression

                                            optional part

                        words   variable   words   ...
expression

                                            optional part

                        words   variable   words   ...
expression

                                            optional part

                        words   variable   words   ...
Interpreter Pattern



Tuesday, 19 May 2009

This is confusing - it comes from GoF. Actually we’re doing compilation here....
# expression
               def interpret
                 children = elements.map {|node| node.interpret }
              ...
# words
               def interpret
                 Regexp.escape(text_value)
               end




Tuesday, 19 May 2009
# variable
               def interpret
                 “(.+?)”
               end




Tuesday, 19 May 2009
# optional_part
               def interpret
                 children = elements.map {|node| node.interpret }
           ...
Adding context



Tuesday, 19 May 2009

For anything more than a simple language, you’ll need to pass around context as yo...
# expression
               def interpret(context=[])
                 children = elements.map do |node|
                 ...
# variable
               def interpret(context)
                 context << identifier.text_value.to_sym
                ...
# expression
               def interpret(context=[])
                 children = elements.map do |node|
                 ...
Other Options



Tuesday, 19 May 2009

You can also build external interpreters / compilers that use the tree
Complications?



Tuesday, 19 May 2009
# We want to write:
               hello [world]

               # We actually mean:
               hello[ world]




Tues...
# We should optimize:
               hello [[[world]]]

               # To this:
               hello [world]




Tuesday...
# Left recursion without consuming input BAD:
               rule infinity_and_beyond
                 infinity_and_beyond...
Problems?



Tuesday, 19 May 2009

Slow.
Other libraries



Tuesday, 19 May 2009

Racc - accepts yacc grammars. Racc runtime is part of the ruby std dist. so once ...
Thanks!

         Twitter: @knaveofdiamonds

         XMPP bot:
         http://github.com/knaveofdiamonds/harken

       ...
Upcoming SlideShare
Loading in...5
×

Treetop - I'd rather have one problem

2,197

Published on

Talk given at LRUG, may, 2009 about Treetop, a ruby parsing expression grammar. It should hopefully convince you that parsers fit better than regular expressions in quite a few cases.

Published in: Technology, Education
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,197
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
32
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Treetop - I'd rather have one problem

  1. 1. Some people, when faced with a problem think, “I know, I’ll use regular expressions”. Now they have two problems. I’d rather have one problem. Treetop • Roland Swingler • LRUG May 2009 Tuesday, 19 May 2009 This quotation is used a lot in presentations, normally before the presenter delves into some gnarly regexps. I’m looking for a better way.
  2. 2. Example 1 Tuesday, 19 May 2009
  3. 3. Tuesday, 19 May 2009 I run a film listing site: http://filmli.st. All the data is scraped from other sites - getting the data is easy with net/http or httparty or similar and then parsing the html with nokogiri or hpricot, but...
  4. 4. <span> Fri/Sun-Tue 10.45 12.30 (Tue) 12.40 (not Tue) 4.00 7.00 9.30; Wed 3.00 7.30 9.00 </span> Tuesday, 19 May 2009 ... you still need to turn a text string like this into a list of Times so you can do interesting things with it. Regexps? No. That way lies madness.
  5. 5. Example 2 Tuesday, 19 May 2009
  6. 6. Tuesday, 19 May 2009 Chatroom bots need to be able to distinguish between messages that they should take actions on and those which they should ignore. How should we define what messages they should listen out for?
  7. 7. /^s*whereiss+(.+?)(?:s+(?:ons+)?(.+?))?s*$/ Tuesday, 19 May 2009 Regular expressions? Pretty confusing.
  8. 8. whereis <person> [[on] <day>] Tuesday, 19 May 2009 Much nicer to have a simpler language.
  9. 9. Example 3 Tuesday, 19 May 2009
  10. 10. Scenario: producing human-readable tests Given I have non-technical stakeholders When I write some integration tests Then they should be understandable by everyone Tuesday, 19 May 2009 Wouldn’t it be great if someone had written a library like this?
  11. 11. Tuesday, 19 May 2009 They have! Cucumber. Cucumber’s implementation got me started looking into...
  12. 12. Tuesday, 19 May 2009 Treetop. A ruby Parsing Expression Grammar. Basically a parser generator, but really simple.
  13. 13. What is a parser? Tuesday, 19 May 2009 A parser determines whether strings are syntactically valid according to a set of rules known as a grammar.
  14. 14. Yes / No Tuesday, 19 May 2009 From a theoretical viewpoint, parsers just say true or false, depending on whether the string is valid or not.
  15. 15. Syntax Tree Tuesday, 19 May 2009 Not so useful, so instead we get back a syntax tree we can do useful things with.
  16. 16. whereis <person> [on <day>] Tuesday, 19 May 2009 Lets try building a tree for this example. You can consider a string to be a list of characters, but to start getting meaning from it, you need a tree.
  17. 17. words words whereis <person> [on <day>] Tuesday, 19 May 2009 We have some words...
  18. 18. words variable words variable whereis <person> [on <day>] Tuesday, 19 May 2009 variables...
  19. 19. optional part words variable words variable whereis <person> [on <day>] Tuesday, 19 May 2009 an optional part of an expression (enclosed with square brackets)
  20. 20. expression optional part words variable words variable whereis <person> [on <day>] Tuesday, 19 May 2009 and a root node for the whole expression
  21. 21. grammar Message end Tuesday, 19 May 2009 lets build that up in treetop. Each of those four types of node in the tree is going to have a rule. We write these rules in a grammar - you think of it like a ruby module.
  22. 22. grammar Message rule expression (words / variable / optional_part)+ end end Tuesday, 19 May 2009 The first rule for the whole expression. Lots of things should be familiar from regular expressions - ‘+’ for one or more, brackets for grouping, and ‘/’ is like the regexp ‘|’ for alternation. So this says an expression is one or more words, variables or optional parts, in any order.
  23. 23. grammar Message rule expression (words / variable / optional_part)+ end rule words [^><[]]+ end end Tuesday, 19 May 2009 words - character classes, just like regexps
  24. 24. grammar Message rule expression (words / variable / optional_part)+ end rule words [^><[]]+ end rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' end end Tuesday, 19 May 2009 variables are enclosed with angle brackets, can be any valid ruby identifier string, and are labeled so we can use part of the text later.
  25. 25. grammar Message rule expression (words / variable / optional_part)+ end rule words [^><[]]+ end rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' end rule optional_part quot;[quot; expression quot;]quot; end end Tuesday, 19 May 2009 optional parts are enclosed with square brackets. Here we see that rules can be recursive - which makes the parser significantly more powerful than regular expressions.
  26. 26. $ tt message.treetop Tuesday, 19 May 2009 We compile the grammar with the command line tt command - you can also load grammars dynamicaly
  27. 27. require ‘message’ parser = MessageParser.new tree = parser.parse(“whereis <person>...”) Tuesday, 19 May 2009 this gives us a parser we can call from ruby code
  28. 28. require ‘message’ parser = MessageParser.new tree = parser.parse(“whereis <person>...”) tree.elements[0].text_value #=> “whereis ” tree.elements[1].identifier.text_value #=> “person” Tuesday, 19 May 2009 each node knows about its children and its text_value. The label we defined earlier provides sugar methods to access particular subnodes.
  29. 29. Fri/Sun-Tue 4.00 7.00 Tuesday, 19 May 2009 Another example. This time we’ll think about the tree in a top down fashion rather than bottom up. This is closer to how treetop will actually evaluate an expression.
  30. 30. expression Fri/Sun-Tue 4.00 7.00 Tuesday, 19 May 2009
  31. 31. expression days times Fri/Sun-Tue 4.00 7.00 Tuesday, 19 May 2009
  32. 32. expression days times day day range time time Fri / Sun-Tue 4.00 7.00 Tuesday, 19 May 2009
  33. 33. expression days times day day range time time day day hrs mins hrs mins Fri / Sun - Tue 4 . 00 7 . 00 Tuesday, 19 May 2009
  34. 34. rule expression days “ ” times end Tuesday, 19 May 2009
  35. 35. rule times time (“ ” time)+ end rule time hours “.” minutes end rule hours 1 [0-2] / [0-9] end rule minutes [0-5] [0-9] end Tuesday, 19 May 2009
  36. 36. rule days (day !“-” / day_range) (“/” days)? end rule day_range day “-” day end rule day “Mon”/“Tue”/“Wed”/“Thu”/“Fri”/“Sat”/“Sun” end Tuesday, 19 May 2009 The bit highlighted in red is a negative lookahead assertion. We need this because treetop evaluates alternatives from left to right - if we didn’t have the assertion then Sun-Tue would match Sun as a Day, not a DayRange, and we’d be left with “-Tue” which isn’t valid.
  37. 37. Enriching Nodes Tuesday, 19 May 2009 Adding in some semantics
  38. 38. rule time hours “.” minutes end irb> aTimeNode.text_value #=> “9.00” irb> aTimeNode.elements.size #=> 3 irb> aTimeNode.hours.text_value #=> “9” Tuesday, 19 May 2009
  39. 39. rule time hours “.” minutes { def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 end } end irb> aTimeNode.text_value #=> “9.00” irb> aTimeNode.to_seconds #=> 32400 Tuesday, 19 May 2009 We can add in methods inline in the grammar. This is just like a module scope, and we can do any ruby we like in here.
  40. 40. # in film_time.treetop rule time hours “.” minutes <TimeNode> end # in another .rb file class TimeNode < Treetop::Runtime::SyntaxNode def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 end end Tuesday, 19 May 2009 Cleaner in my mind to split these out into actual subclasses of SyntaxNode - keeps the grammar more readable. In some cases you need to have modules rather than subclasses.
  41. 41. Interpretation & Compilation Tuesday, 19 May 2009 We’re going to build up a regular expression for the bot example. Each node will be reponsible for building a different part of the regexp.
  42. 42. expression optional part words variable words variable whereis <person> [on <day>] /^whereis (.+?)(?:s+on (.+?))?$/ Tuesday, 19 May 2009
  43. 43. expression optional part words variable words variable whereis <person> [on <day>] /^whereis (.+?)(?:s+on (.+?))?$/ Tuesday, 19 May 2009
  44. 44. expression optional part words variable words variable whereis <person> [on <day>] /^whereis (.+?)(?:s+on (.+?))?$/ Tuesday, 19 May 2009
  45. 45. expression optional part words variable words variable whereis <person> [on <day>] /^whereis (.+?)(?:s+on (.+?))?$/ Tuesday, 19 May 2009
  46. 46. expression optional part words variable words variable whereis <person> [on <day>] /^whereis (.+?)(?:s+on (.+?))?$/ Tuesday, 19 May 2009
  47. 47. Interpreter Pattern Tuesday, 19 May 2009 This is confusing - it comes from GoF. Actually we’re doing compilation here. Each node gets an interpret method - you treat the syntax tree as a composite.
  48. 48. # expression def interpret children = elements.map {|node| node.interpret } RegExp.compile(“^” + children.join + “$”) end Tuesday, 19 May 2009
  49. 49. # words def interpret Regexp.escape(text_value) end Tuesday, 19 May 2009
  50. 50. # variable def interpret “(.+?)” end Tuesday, 19 May 2009
  51. 51. # optional_part def interpret children = elements.map {|node| node.interpret } “(?:s+” + children.join + “)?” end Tuesday, 19 May 2009
  52. 52. Adding context Tuesday, 19 May 2009 For anything more than a simple language, you’ll need to pass around context as you interpret the tree.
  53. 53. # expression def interpret(context=[]) children = elements.map do |node| node.interpret(context) end matcher = RegExp.new(“^” + children.join + “$”) ... Tuesday, 19 May 2009 In our case we just want to record the list of variable names, so an Array will suffice. Each interpret method now needs to take this context.
  54. 54. # variable def interpret(context) context << identifier.text_value.to_sym “(.+?)” end Tuesday, 19 May 2009
  55. 55. # expression def interpret(context=[]) children = elements.map do |node| node.interpret(context) end matcher = RegExp.new(“^” + children.join + “$”) class << matcher send(:define_method, :variables) do context end end matcher end Tuesday, 19 May 2009 we decorate the regular expression with a list of the variables. In the real code, the returned match objects are also decorated so you have methods for each variable and don’t have to remember the captured groups by position
  56. 56. Other Options Tuesday, 19 May 2009 You can also build external interpreters / compilers that use the tree
  57. 57. Complications? Tuesday, 19 May 2009
  58. 58. # We want to write: hello [world] # We actually mean: hello[ world] Tuesday, 19 May 2009 whitespace shuffling. In the reall code, grammar is more complicated - most of the complication comes from dealing with edge cases here
  59. 59. # We should optimize: hello [[[world]]] # To this: hello [world] Tuesday, 19 May 2009 This isn’t done in the real code, but should be.
  60. 60. # Left recursion without consuming input BAD: rule infinity_and_beyond infinity_and_beyond / “foo” end Tuesday, 19 May 2009
  61. 61. Problems? Tuesday, 19 May 2009 Slow.
  62. 62. Other libraries Tuesday, 19 May 2009 Racc - accepts yacc grammars. Racc runtime is part of the ruby std dist. so once you’ve built your parser there is no dependency. Ragel - used by mongrel/thin.
  63. 63. Thanks! Twitter: @knaveofdiamonds XMPP bot: http://github.com/knaveofdiamonds/harken Film listings for London’s indie cinemas: http://filmli.st Treetop: http://github.com/nathansobo/treetop http://treetop.rubyforge.org Tuesday, 19 May 2009
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×