This document summarizes a talk about using parsers to extract structured data from unstructured text. It introduces Treetop, a Ruby library for building parsers based on Parsing Expression Grammars (PEGs). It provides examples of using Treetop grammars to parse configuration files and discusses techniques like using Ruby code in grammars and handling nested structures. The document notes that Treetop parsers are fast and linear time but can be memory intensive for large inputs. It recommends Treetop and links to further reading on PEGs and the Treetop library.
2. Say what? Why do I need a Parser?
Argentina.blft # real world example Big Lever Gears Format
propertyUrlPrefix = "propiedad";
landingPageUrlPrefix = "alquileres-vacaciones";
propertyReviewsPrefix = "reviews";
propertyReviewsWritePrefix = "reviews/write";
propertyReviewsConfirmPrefix = "reviews/confirm";
propertyReviewsResponsePrefix = "reviews/response";
6. Ok… maybe not
# this just extracts the values, we haven’t
even
# begun to set the correct type or handle the
# nesting
/([a-zA-Z0-9])+=([a-zA-Z]+[^;]*|'"'[^"]*'"'|[0-
9]+|(true|false))/
23. On Parsing Expression Grammars
Parsing expression grammars (PEGs) are an alternative to
context free grammars for formally specifying syntax, and
packrat parsers are parsers for PEGs that operate in guaranteed
linear time through the use of memoization.
• Linear time, fast!
• Memory hog, storage proportional to the total input size
• Not suitable for natural language processing