01/30/15 1
Text Processing With Boost
O r, "Anything yo u can do , Ican
do be tte r"
copyright 2006 David Abrahams, Eric Niebler01/30/15 2
Talk Overview
Goal: Become productive C++ string manipulators
with t...
01/30/15 3
Part 1: The Simple Stuff
Utilities for Ad Hoc Text
Manipulation
copyright 2006 David Abrahams, Eric Niebler01/30/15 4
A Legacy of Inadequacy
 Python:
>>> int('123')
123
>>> str(123)
'12...
copyright 2006 David Abrahams, Eric Niebler01/30/15 5
Stringstream: A Better atoi()
{
std::stringstream sout;
std::string ...
copyright 2006 David Abrahams, Eric Niebler01/30/15 6
Boost.Lexical_cast
// Approximate implementation ...
template< typen...
copyright 2006 David Abrahams, Eric Niebler01/30/15 7
Boost.Lexical_cast
int i = lexical_cast<int>( "123" );
std::string s...
copyright 2006 David Abrahams, Eric Niebler01/30/15 8
Boost.String_algo
 Extension to std:: algorithms
 Includes algorit...
copyright 2006 David Abrahams, Eric Niebler01/30/15 9
Hello, String_algo!
#include <boost/algorithm/string.hpp>
using name...
copyright 2006 David Abrahams, Eric Niebler01/30/15 10
String_algo: split()
std::string str( "abc-*-ABC-*-aBc" );
std::vec...
01/30/15 11
Part 2: The Interesting Stuff
Structure d Te xt Manipulatio n
with Do m ain Spe cific
Lang uag e s
copyright 2006 David Abrahams, Eric Niebler01/30/15 12
Overview
 Declarative Programming and Domain-
Specific Languages.
...
copyright 2006 David Abrahams, Eric Niebler01/30/15 13
Grammar Refresher
Imperative Sentence: n.
Expressing a command or r...
copyright 2006 David Abrahams, Eric Niebler01/30/15 14
Computer Science Refresher
Imperative Programming: n.
A programming...
copyright 2006 David Abrahams, Eric Niebler01/30/15 15
Find/Print an Email Subject
std::string line;
while (std::getline(s...
copyright 2006 David Abrahams, Eric Niebler01/30/15 16
Find/Print an Email Subject
std::string line;
boost::regex pat( "^S...
copyright 2006 David Abrahams, Eric Niebler01/30/15 17
Which do you prefer?
Imperative: Declarative:
if (line.compare(...)...
copyright 2006 David Abrahams, Eric Niebler01/30/15 18
Riddle me this ...
If declarative is so much better
than imperative...
copyright 2006 David Abrahams, Eric Niebler01/30/15 19
Best of Both Worlds
 Domain-Specific Embedded Languages
A declara...
copyright 2006 David Abrahams, Eric Niebler01/30/15 20
Boost.Regex in Depth
 A powerful DSEL for text manipulation
 Acce...
copyright 2006 David Abrahams, Eric Niebler01/30/15 21
Dynamic DSEL in C++
 Embedded statements in strings
 Parsed at ru...
copyright 2006 David Abrahams, Eric Niebler01/30/15 22
The Regex Language
Syntax Meaning
^ Beginning-of-line assertion
$ E...
copyright 2006 David Abrahams, Eric Niebler01/30/15 24
Algorithm: regex_match
 Checks if a pattern matches the whole
inpu...
copyright 2006 David Abrahams, Eric Niebler01/30/15 25
Algorithm: regex_search
 Scans input to find a match
 Example: sc...
copyright 2006 David Abrahams, Eric Niebler01/30/15 26
Algorithm: regex_replace
 Replaces occurrences of a pattern
 Exam...
copyright 2006 David Abrahams, Eric Niebler01/30/15 27
Iterator: regex_iterator
 Iterates through all occurrences of a
pa...
copyright 2006 David Abrahams, Eric Niebler01/30/15 28
Iterator: regex_token_iterator
 Tokenizes input according to patte...
copyright 2006 David Abrahams, Eric Niebler01/30/15 29
Regex Challenge!
 Write a regex to match balanced, nested
braces, ...
copyright 2006 David Abrahams, Eric Niebler01/30/15 30
It's funny ... laugh.
“Some people, when confronted
with a problem,...
copyright 2006 David Abrahams, Eric Niebler01/30/15 31
Introducing Boost.Spirit
 Parser Generator
similar in purpose to ...
copyright 2006 David Abrahams, Eric Niebler01/30/15 32
Static DSEL in C++
 Embedded statements are C++
expressions
 Pars...
copyright 2006 David Abrahams, Eric Niebler01/30/15 33
Infix Calculator Grammar
 In Extended Backus-Naur Form
group ::= '...
copyright 2006 David Abrahams, Eric Niebler01/30/15 34
group ::= '(' expr ')'
fact ::= integer | group;
term ::= fact (('*...
copyright 2006 David Abrahams, Eric Niebler01/30/15 36
Spirit Parser Primitives
Syntax Meaning
ch_p('X') Match literal cha...
copyright 2006 David Abrahams, Eric Niebler01/30/15 37
Spirit Parser Operations
Syntax Meaning
x >> y Match x followed by ...
copyright 2006 David Abrahams, Eric Niebler01/30/15 40
Algorithm: spirit::parse
#include <boost/spirit.hpp>
using namespac...
copyright 2006 David Abrahams, Eric Niebler01/30/15 45
Semantic Actions
 Action to take when part of your grammar
succeed...
copyright 2006 David Abrahams, Eric Niebler01/30/15 46
Semantic Actions
 A few parsers process input first
void write(int...
copyright 2006 David Abrahams, Eric Niebler01/30/15 50
Should I use Regex or Spirit?
Regex Spirit
Ad-hoc pattern matching,...
copyright 2006 David Abrahams, Eric Niebler01/30/15 51
A Peek at Xpressive
 A regex library in the Spirit of Boost.Regex
...
copyright 2006 David Abrahams, Eric Niebler01/30/15 52
Xpressive: A Mixed-Mode DSEL
 Mix-n-match static and dynamic regex...
copyright 2006 David Abrahams, Eric Niebler01/30/15 53
Sizing It All Up
Regex Spirit Xpr
Ad-hoc pattern matching, regular ...
01/30/15 54
Appendix: Boost and
Unicode
Future Directions
copyright 2006 David Abrahams, Eric Niebler01/30/15 55
Wouldn't it be nice ...
Hmm ... where,
oh where, is
Boost.Unicode?
copyright 2006 David Abrahams, Eric Niebler01/30/15 56
UTF-8 Conversion Facet
 Converts UTF-8 input to UCS-4
 For use wi...
copyright 2006 David Abrahams, Eric Niebler01/30/15 57
UTF-8 Conversion Facet
#define BOOST_UTF8_BEGIN_NAMESPACE
#define B...
copyright 2006 David Abrahams, Eric Niebler01/30/15 58
Thanks!
 Boost: http://boost.org
 BoostCon: http://www.boostcon.o...
Upcoming SlideShare
Loading in...5
×

Text_Processing_With.. - The Iterator Blues

1,122

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,122
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • No, you can&amp;apos;t. Yes, I can!
    Irving Berlin, Annie Get Your Gun.
  • Remember program flow diagrams? Input -&amp;gt; Magic Happens -&amp;gt; Output.
    Input and Output are often text (text/config files, html/xml, scripts, log files, etc.)
    Std support for text manipulation poor.
    Boost fills gaps in Std and adds more powerful utilities.
  • atoi: If the input cannot be converted to a value of that type, the return value is 0. The return value is undefined in case of overflow.
  • Actually, anything can be converted to anything this way as long as both types are streamable
  • *_copy() variants create copies
    i* variants are case-insensitive (uses global locale)
  • So fundamental, it&amp;apos;s a bit embarrassing to even talk about it!
  • In grammar, an imperative sentence gives an instruction. First, do this, then do that.
    A declarative sentence makes a simple statement – an observation about the way things are.
    (Interrogative sentence? Exclamatory sentence?)
  • “Imperative programs make the algorithm explicit and leave the goal implicit, while declarative programs make the goal explicit and leave the algorithm implicit.” – Wikipedia, declarative programming.
    (Interrogative Programming? Exclamatory Programming? a = b!!! )
    Question: What is an example of a language that supports declarative programming?
    Answer: Prolog. HTML/XML. SQL.
    Question: C++ supports which programming paradigm?
    Answer #1: Imperative. C,C++,Java,C# are all imperative programming languages.
    Answer #2: Both! C++ is a multi-paradigm language; it has the power to support declarative programming through powerful libraries.
  • Style is imperative. First do this, then do that.
    Describes algorithm, goal is implicit.
    Code is brittle and hard to maintain.
  • Code is simpler.
    Clearly states programmer intention.
    Describes goal (find an email subject), not algorithm. Forest/trees.
  • Answer: Imperative languages are more general purpose.
    Since you cannot specify the algorithm with declarative programming, it’s best suited to situations where the algorithm can only take a limited set of forms.
    That’s typically true for narrow domains; e.g. finite automata for pattern matching, recursive descent for parsing, etc.
    Conclusion: declarative programming is ideal for domain-specific languages, imperative is better for general-purpose programming languages.
  • domain-specific for high-level, declarative solutions in specific domains
    general purpose for everything else
    can host multiple DSLs within the same general-purpose language!
  • Q: Why “\\d” instead of “\d”?
    A: The first escape is consumed by the C++ compiler. The second by the regex parser.
    {n} is a special postfix quantifier that means to match the previous thing exactly n times.
  • Will only succeed when the pattern matches the whole input!
  • Regex_constants::icase makes this a case-insensitive pattern.
    “.*?” means match any character zero or more times, preferring the shorter matches.
  • Incrementing a regex_iterator moves to the next match
    Dereferencing a regex_iterator returns a reference to a match_results
  • Last parameter tells how to extract tokens from a match. “1” means to extract capture #1. {0,1} would extract whole match followed by first capture.
    Incrementing a regex_token_iterator moves to the next token, either within the current match or from the next match.
    Dereferencing a regex_token_iterator returns a reference to a string.
  • The limits of regular expressions
    They can only find regular patterns (no recursive patterns)
    Little support for contextual matching
    Can’t build parsers out of them!
    Jamie Zawinski: XEmacs author, original author of Netscape Navigator
  • &amp;quot;Full access to types and data&amp;quot; : This becomes important later when we build parsers with semantic actions
  • Demonstrates use of expression templates to implement a static DSEL in C++
    Note use of overloaded operators
    postfix becomes prefix, etc.
    rules are pre-declared. A grammar is made up of rules.
  • _p means &amp;quot;parser&amp;quot;
  • For operator overloading to kick in, at least one operand must be a user-defined type.
  • right-hand side can be arbitrarily complicated.
  • This parser works on the character level – it will not ignore spaces. So &amp;quot;1 + 3&amp;quot; will not match!
  • A phrase_scanner is really a skip parser.
  • A grammar IS-A parser, and can be used wherever a parser can.
  • Notice the expression is used in-place without first assigning to a rule&amp;lt;&amp;gt;
    Actions can be anything that can be invoked as Func(iter1, iter2)
  • The first closure-generated member can be used as a rule’s “return value”
  • arg1 is a placeholder like lambda::_1
    It receives the return value of whatever parser to which this action is attached
    Since these actions are attached to rules with closures, arg1 gets the current value of the rule&amp;apos;s closure.
  • As with a rule, the &amp;quot;return value&amp;quot; of a grammar with a closure is the closure context.
    var(n) is needed here to turn the double into a phoenix lambda expression.
  • A Boost.Regex work-alike. Interface closely follows that of TR1 regex (but not exactly)
  • Embed a dynamic regex in a static regex – fully interoperable.
    Unlike Spirit, xpressive regex object embed by value by default.
    There is a special by_ref() helper for embedding a regex by reference. Useful for creating recursive regexes.
  • Xpressive actually has semantic actions, but they&amp;apos;re undocumented for 1.0. They&amp;apos;ll become official in 2.0.
  • Caveat: not an official Boost library!
  • Text_Processing_With.. - The Iterator Blues

    1. 1. 01/30/15 1 Text Processing With Boost O r, "Anything yo u can do , Ican do be tte r"
    2. 2. copyright 2006 David Abrahams, Eric Niebler01/30/15 2 Talk Overview Goal: Become productive C++ string manipulators with the help of Boost. 1. The Simple Stuff  Boost.Lexical_cast  Boost.String_algo 2. The Interesting Stuff  Boost.Regex  Boost.Spirit  Boost.Xpressive
    3. 3. 01/30/15 3 Part 1: The Simple Stuff Utilities for Ad Hoc Text Manipulation
    4. 4. copyright 2006 David Abrahams, Eric Niebler01/30/15 4 A Legacy of Inadequacy  Python: >>> int('123') 123 >>> str(123) '123'  C++: int i = atoi("123"); char buff[10]; itoa(123, buff, 10); No error handling! No error handling! Complicated interface Not actually standard!
    5. 5. copyright 2006 David Abrahams, Eric Niebler01/30/15 5 Stringstream: A Better atoi() { std::stringstream sout; std::string str; sout << 123; sout >> str; // OK, str == "123" } { std::stringstream sout; int i; sout << "789"; sout >> i; // OK, i == 789 }
    6. 6. copyright 2006 David Abrahams, Eric Niebler01/30/15 6 Boost.Lexical_cast // Approximate implementation ... template< typename Target, typename Source > Target lexical_cast(Source const & arg) { std::stringstream sout; Target result; if(!(sout << arg && sout >> result)) throw bad_lexical_cast( typeid(Source), typeid(Target)); return result; } Kevlin Henney
    7. 7. copyright 2006 David Abrahams, Eric Niebler01/30/15 7 Boost.Lexical_cast int i = lexical_cast<int>( "123" ); std::string str = lexical_cast<std::string>( 789 );  Ugly name  Sub-par performance  No i18n  Clean Interface  Error Reporting, Yay!  Extensible
    8. 8. copyright 2006 David Abrahams, Eric Niebler01/30/15 8 Boost.String_algo  Extension to std:: algorithms  Includes algorithms for: trimming case-conversions find/replace utilities ... and much more! Pavol Droba
    9. 9. copyright 2006 David Abrahams, Eric Niebler01/30/15 9 Hello, String_algo! #include <boost/algorithm/string.hpp> using namespace std; using namespace boost; string str1(" hello world! "); to_upper(str1); // str1 == " HELLO WORLD! " trim(str1); // str1 == "HELLO WORLD!" string str2 = to_lower_copy( ireplace_first_copy( str1, "hello", "goodbye")); // str2 == "goodbye world!" Mutate String In-Place Create a New String Composable Algorithms!
    10. 10. copyright 2006 David Abrahams, Eric Niebler01/30/15 10 String_algo: split() std::string str( "abc-*-ABC-*-aBc" ); std::vector< std::string > tokens; split( tokens, str, is_any_of("-*") ); // OK, tokens == { "abc", "ABC", "aBc" } Other Classifications: is_space(), is_upper(), etc. is_from_range('a','z'), is_alnum() || is_punct()
    11. 11. 01/30/15 11 Part 2: The Interesting Stuff Structure d Te xt Manipulatio n with Do m ain Spe cific Lang uag e s
    12. 12. copyright 2006 David Abrahams, Eric Niebler01/30/15 12 Overview  Declarative Programming and Domain- Specific Languages.  Manipulating Text Dynamically Boost.Regex  Generating Parsers Statically Boost.Spirit  Mixed-Mode Pattern Matching Boost.Xpressive
    13. 13. copyright 2006 David Abrahams, Eric Niebler01/30/15 13 Grammar Refresher Imperative Sentence: n. Expressing a command or request. E.g., “Set the TV on fire.” Declarative Sentence: n. Serving to declare or state. E.g., “The TV is on fire.”
    14. 14. copyright 2006 David Abrahams, Eric Niebler01/30/15 14 Computer Science Refresher Imperative Programming: n. A programming paradigm that describes computation in terms of a program state and statements that change the program state. Declarative Programming: n. A programming paradigm that describes computation in terms of what to compute, not how to compute it.
    15. 15. copyright 2006 David Abrahams, Eric Niebler01/30/15 15 Find/Print an Email Subject std::string line; while (std::getline(std::cin, line)) { if (line.compare(0, 9, "Subject: ") == 0) { std::size_t offset = 9; if (line.compare(offset, 4, "Re: ")) offset += 4; std::cout << line.substr(offset); } }
    16. 16. copyright 2006 David Abrahams, Eric Niebler01/30/15 16 Find/Print an Email Subject std::string line; boost::regex pat( "^Subject: (Re: )?(.*)" ); boost::smatch what; while (std::getline(std::cin, line)) { if (boost::regex_match(line, what, pat)) std::cout << what[2]; }
    17. 17. copyright 2006 David Abrahams, Eric Niebler01/30/15 17 Which do you prefer? Imperative: Declarative: if (line.compare(...) == 0) { std::size_t offset = ...; if (line.compare(...) == 0) offset += ...; } "^Subject: (Re: )?(.*)"  Describes algorithm  Verbose  Hard to maintain  Describes goal  Concise  Easy to maintain
    18. 18. copyright 2006 David Abrahams, Eric Niebler01/30/15 18 Riddle me this ... If declarative is so much better than imperative, why are most popular programming languages imperative?
    19. 19. copyright 2006 David Abrahams, Eric Niebler01/30/15 19 Best of Both Worlds  Domain-Specific Embedded Languages A declarative DSL hosted in an imperative general-purpose language.  Examples: Ruby on Rails in Ruby JUnit Test Framework in Java Regex in perl, C/C++, .NET, etc.
    20. 20. copyright 2006 David Abrahams, Eric Niebler01/30/15 20 Boost.Regex in Depth  A powerful DSEL for text manipulation  Accepted into std::tr1 Coming in C++0x!  Useful constructs for: matching searching replacing tokenizing John Maddock
    21. 21. copyright 2006 David Abrahams, Eric Niebler01/30/15 21 Dynamic DSEL in C++  Embedded statements in strings  Parsed at runtime  Executed by an interpreter  Advantages Free-form syntax New statements can be accepted at runtime  Examples regex: "^Subject: (Re: )?(.*)" SQL: "SELECT * FROM Employees ORDER BY Salary"
    22. 22. copyright 2006 David Abrahams, Eric Niebler01/30/15 22 The Regex Language Syntax Meaning ^ Beginning-of-line assertion $ End-of-line assertion . Match any single character [abc] Match any of ‘a’, ‘b’, or ‘c’ [^0-9] Match any character not in the range ‘0’ through ‘9’ w, d, s Match a word, digit, or space character *, +, ? Zero or more, one or more, or optional (postfix, greedy) (stuff) Numbered capture: remember what stuff matches 1 Match what the 1st numbered capture matched
    23. 23. copyright 2006 David Abrahams, Eric Niebler01/30/15 24 Algorithm: regex_match  Checks if a pattern matches the whole input.  Example: Match a Social Security Number std::string line; boost::regex ssn("d{3}-dd-d{4}"); while (std::getline(std::cin, line)) { if (boost::regex_match(line, ssn)) break; std::cout << "Invalid SSN. Try again." << std::endl; }
    24. 24. copyright 2006 David Abrahams, Eric Niebler01/30/15 25 Algorithm: regex_search  Scans input to find a match  Example: scan HTML for an email address std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); smatch what; if (boost::regex_search(html, what, mailto)) { std::cout << "Email address to spam: " << what[1]; }
    25. 25. copyright 2006 David Abrahams, Eric Niebler01/30/15 26 Algorithm: regex_replace  Replaces occurrences of a pattern  Example: Simple URL escaping std::string url = "http://foo.net/this has spaces"; std::string format = "%20"; boost::regex pat(" "); // This changes url to "http://foo.net/this%20has%20spaces" url = boost::regex_replace(url, pat, format);
    26. 26. copyright 2006 David Abrahams, Eric Niebler01/30/15 27 Iterator: regex_iterator  Iterates through all occurrences of a pattern  Example: scan HTML for email addresses std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); sregex_iterator begin(html.begin(), html.end(), mailto), end; for (; begin != end; ++begin) { smatch const & what = *begin; std::cout << "Email address to spam: " << what[1] << "n"; }
    27. 27. copyright 2006 David Abrahams, Eric Niebler01/30/15 28 Iterator: regex_token_iterator  Tokenizes input according to pattern  Example: scan HTML for email addresses std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); sregex_token_iterator begin(html.begin(), html.end(), mailto, 1), end; std::ostream_iterator<std::string> out(std::cout, "n"); std::copy(begin, end, out); // write all email addresses to std::cout std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); sregex_token_iterator begin(html.begin(), html.end(), mailto, 1), end; using namespace boost::lambda; std::for_each(begin, end, std::cout << _1 << 'n');
    28. 28. copyright 2006 David Abrahams, Eric Niebler01/30/15 29 Regex Challenge!  Write a regex to match balanced, nested braces, e.g. "{ foo { bar } baz }" regex braces("{[^{}]*}"); regex braces("{[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]* ({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{ }]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({ [^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}] *({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^ {}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*( {[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{} ]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[ ^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}] regex braces("{[^{}]*({[^{}]*}[^{}]*)*}"); regex braces("{[^{}]*({[^{}]*({[^{}]*}[^{}]*)*}[^{}]*)*}"); Not quite. Better, but no. Not there, yet. Whoops!
    29. 29. copyright 2006 David Abrahams, Eric Niebler01/30/15 30 It's funny ... laugh. “Some people, when confronted with a problem, think, ‘I know, I’ll use regular expressions.’ Now they have two problems.” --Jamie Zawinski, in comp.lang.emacs
    30. 30. copyright 2006 David Abrahams, Eric Niebler01/30/15 31 Introducing Boost.Spirit  Parser Generator similar in purpose to lex / YACC  DSEL for declaring grammars grammars can be recursive DSEL approximates Backus-Naur Form  Statically embedded language Domain-specific statements are composed from C++ expressions. Joel de Guzman
    31. 31. copyright 2006 David Abrahams, Eric Niebler01/30/15 32 Static DSEL in C++  Embedded statements are C++ expressions  Parsed at compile time  Generates machine-code, executed directly  Advantages: Syntax-checked by the compiler Better performance Full access to types and data in your program
    32. 32. copyright 2006 David Abrahams, Eric Niebler01/30/15 33 Infix Calculator Grammar  In Extended Backus-Naur Form group ::= '(' expr ')' fact ::= integer | group; term ::= fact (('*' fact) | ('/' fact))* expr ::= term (('+' term) | ('-' term))*
    33. 33. copyright 2006 David Abrahams, Eric Niebler01/30/15 34 group ::= '(' expr ')' fact ::= integer | group; term ::= fact (('*' fact) | ('/' fact))* expr ::= term (('+' term) | ('-' term))* Infix Calculator Grammar  In Boost.Spirit spirit::rule<> group, fact, term, expr; group = '(' >> expr >> ')'; fact = spirit::int_p | group; term = fact >> *(('*' >> fact) | ('/' >> fact)); expr = term >> *(('+' >> term) | ('-' >> term));
    34. 34. copyright 2006 David Abrahams, Eric Niebler01/30/15 36 Spirit Parser Primitives Syntax Meaning ch_p('X') Match literal character 'X' range_p('a','z') Match characters in the range 'a' through 'z' str_p("hello") Match the literal string "hello" chseq_p("ABCD" ) Like str_p, but ignores whitespace anychar_p Matches any single character chset_p("1234") Matches any of '1', '2', '3', or '4' eol_p Matches end-of-line (CR/LF and combinations) end_p Matches end of input nothing_p Matches nothing, always fails
    35. 35. copyright 2006 David Abrahams, Eric Niebler01/30/15 37 Spirit Parser Operations Syntax Meaning x >> y Match x followed by y x | y Match x or y ~x Match any char not x (x is a single-char parser) x - y Difference: match x but not y *x Match x zero or more times +x Match x one or more times !x x is optional x[ f ] Semantic action: invoke f when x matches
    36. 36. copyright 2006 David Abrahams, Eric Niebler01/30/15 40 Algorithm: spirit::parse #include <boost/spirit.hpp> using namespace boost; int main() { spirit::rule<> group, fact, term, expr; group = '(' >> expr >> ')'; fact = spirit::int_p | group; term = fact >> *(('*' >> fact) | ('/' >> fact)); expr = term >> *(('+' >> term) | ('-' >> term)); assert( spirit::parse("2*(3+4)", expr).full ); assert( ! spirit::parse("2*(3+4", expr).full ); } Parse strings as an expr (“start symbol” = expr). spirit::parse returns a spirit::parse_info<> struct.
    37. 37. copyright 2006 David Abrahams, Eric Niebler01/30/15 45 Semantic Actions  Action to take when part of your grammar succeeds void write(char const *begin, char const *end) { std::cout.write(begin, end – begin); } // This prints "hi" to std::cout spirit::parse("{hi}", '{' >> (*alpha_p)[&write] >> '}'); Match alphabetic characters, call write() with range of characters that matched.
    38. 38. copyright 2006 David Abrahams, Eric Niebler01/30/15 46 Semantic Actions  A few parsers process input first void write(int d) { std::cout << d; } // This prints "42" to std::cout spirit::parse("(42)", '(' >> int_p[&write] >> ')'); int_p "returns" an int. using namespace lambda; // This prints "42" to std::cout spirit::parse("(42)", '(' >> int_p[cout << _1] >> ')'); We can use a Boost.Lambda expression as a semantic action!
    39. 39. copyright 2006 David Abrahams, Eric Niebler01/30/15 50 Should I use Regex or Spirit? Regex Spirit Ad-hoc pattern matching, regular languages   Structured parsing, context-free grammars   Manipulating text   Semantic actions, manipulating program state   Dynamic; new statements at runtime   Static; no new statements at runtime   Exhaustive backtracking semantics  
    40. 40. copyright 2006 David Abrahams, Eric Niebler01/30/15 51 A Peek at Xpressive  A regex library in the Spirit of Boost.Regex (pun intended)  Both a static and a dynamic DSEL! Dynamic syntax is similar to Boost.Regex Static syntax is similar to Boost.Spirit using namespace boost::xpressive; sregex dyn = sregex::compile( "Subject: (Re: )?(.*)" ); sregex sta = "Subject: " >> !(s1= "Re: ") >> (s2= *_); dyn is a dynamic regex sta is a static regex
    41. 41. copyright 2006 David Abrahams, Eric Niebler01/30/15 52 Xpressive: A Mixed-Mode DSEL  Mix-n-match static and dynamic regex // Get a pattern from the user at runtime: std::string str = get_pattern(); sregex pat = sregex::compile( str ); // Wrap the regex in begin- and end-word assertions: pat = bow >> pat >> eow; sregex braces, not_brace; not_brace = ~(set= '{', '}'); braces = '{' >> *(+not_brace | by_ref(braces)) >> '}';  Embed regexes by reference, too
    42. 42. copyright 2006 David Abrahams, Eric Niebler01/30/15 53 Sizing It All Up Regex Spirit Xpr Ad-hoc pattern matching, regular languages    Structured parsing, context-free grammars    Manipulating text    Semantic actions, manipulating program state    Dynamic; new statements at runtime    Static; no new statements at runtime    Exhaustive backtracking semantics    Blessed by TR1   
    43. 43. 01/30/15 54 Appendix: Boost and Unicode Future Directions
    44. 44. copyright 2006 David Abrahams, Eric Niebler01/30/15 55 Wouldn't it be nice ... Hmm ... where, oh where, is Boost.Unicode?
    45. 45. copyright 2006 David Abrahams, Eric Niebler01/30/15 56 UTF-8 Conversion Facet  Converts UTF-8 input to UCS-4  For use with std::locale  Implementation detail!  But useful nonetheless
    46. 46. copyright 2006 David Abrahams, Eric Niebler01/30/15 57 UTF-8 Conversion Facet #define BOOST_UTF8_BEGIN_NAMESPACE #define BOOST_UTF8_END_NAMESPACE #define BOOST_UTF8_DECL #include <fstream> #include <boost/detail/utf8_codecvt_facet.hpp> #include <libs/detail/utf8_codecvt_facet.cpp> int main() { std::wstring str; std::wifstream bad("C:utf8.txt"); bad >> str; assert( str == L"äöü" ); // OOPS! :-( std::wifstream good("C:utf8.txt"); good.imbue(std::locale(std::locale(), new utf8_codecvt_facet)); good >> str; assert( str == L"äöü" ); // SUCCESS!! :-) }
    47. 47. copyright 2006 David Abrahams, Eric Niebler01/30/15 58 Thanks!  Boost: http://boost.org  BoostCon: http://www.boostcon.org Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×