Your SlideShare is downloading. ×
Text_Processing_With.. - The Iterator Blues
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Text_Processing_With.. - The Iterator Blues

1,083

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,083
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • No, you can't. Yes, I can!
    Irving Berlin, Annie Get Your Gun.
  • Remember program flow diagrams? Input -> Magic Happens -> Output.
    Input and Output are often text (text/config files, html/xml, scripts, log files, etc.)
    Std support for text manipulation poor.
    Boost fills gaps in Std and adds more powerful utilities.
  • atoi: If the input cannot be converted to a value of that type, the return value is 0. The return value is undefined in case of overflow.
  • Actually, anything can be converted to anything this way as long as both types are streamable
  • *_copy() variants create copies
    i* variants are case-insensitive (uses global locale)
  • So fundamental, it's a bit embarrassing to even talk about it!
  • In grammar, an imperative sentence gives an instruction. First, do this, then do that.
    A declarative sentence makes a simple statement – an observation about the way things are.
    (Interrogative sentence? Exclamatory sentence?)
  • “Imperative programs make the algorithm explicit and leave the goal implicit, while declarative programs make the goal explicit and leave the algorithm implicit.” – Wikipedia, declarative programming.
    (Interrogative Programming? Exclamatory Programming? a = b!!! )
    Question: What is an example of a language that supports declarative programming?
    Answer: Prolog. HTML/XML. SQL.
    Question: C++ supports which programming paradigm?
    Answer #1: Imperative. C,C++,Java,C# are all imperative programming languages.
    Answer #2: Both! C++ is a multi-paradigm language; it has the power to support declarative programming through powerful libraries.
  • Style is imperative. First do this, then do that.
    Describes algorithm, goal is implicit.
    Code is brittle and hard to maintain.
  • Code is simpler.
    Clearly states programmer intention.
    Describes goal (find an email subject), not algorithm. Forest/trees.
  • Answer: Imperative languages are more general purpose.
    Since you cannot specify the algorithm with declarative programming, it’s best suited to situations where the algorithm can only take a limited set of forms.
    That’s typically true for narrow domains; e.g. finite automata for pattern matching, recursive descent for parsing, etc.
    Conclusion: declarative programming is ideal for domain-specific languages, imperative is better for general-purpose programming languages.
  • domain-specific for high-level, declarative solutions in specific domains
    general purpose for everything else
    can host multiple DSLs within the same general-purpose language!
  • Q: Why “\\d” instead of “\d”?
    A: The first escape is consumed by the C++ compiler. The second by the regex parser.
    {n} is a special postfix quantifier that means to match the previous thing exactly n times.
  • Will only succeed when the pattern matches the whole input!
  • Regex_constants::icase makes this a case-insensitive pattern.
    “.*?” means match any character zero or more times, preferring the shorter matches.
  • Incrementing a regex_iterator moves to the next match
    Dereferencing a regex_iterator returns a reference to a match_results
  • Last parameter tells how to extract tokens from a match. “1” means to extract capture #1. {0,1} would extract whole match followed by first capture.
    Incrementing a regex_token_iterator moves to the next token, either within the current match or from the next match.
    Dereferencing a regex_token_iterator returns a reference to a string.
  • The limits of regular expressions
    They can only find regular patterns (no recursive patterns)
    Little support for contextual matching
    Can’t build parsers out of them!
    Jamie Zawinski: XEmacs author, original author of Netscape Navigator
  • "Full access to types and data" : This becomes important later when we build parsers with semantic actions
  • Demonstrates use of expression templates to implement a static DSEL in C++
    Note use of overloaded operators
    postfix becomes prefix, etc.
    rules are pre-declared. A grammar is made up of rules.
  • _p means "parser"
  • For operator overloading to kick in, at least one operand must be a user-defined type.
  • right-hand side can be arbitrarily complicated.
  • This parser works on the character level – it will not ignore spaces. So "1 + 3" will not match!
  • A phrase_scanner is really a skip parser.
  • A grammar IS-A parser, and can be used wherever a parser can.
  • Notice the expression is used in-place without first assigning to a rule<>
    Actions can be anything that can be invoked as Func(iter1, iter2)
  • The first closure-generated member can be used as a rule’s “return value”
  • arg1 is a placeholder like lambda::_1
    It receives the return value of whatever parser to which this action is attached
    Since these actions are attached to rules with closures, arg1 gets the current value of the rule's closure.
  • As with a rule, the "return value" of a grammar with a closure is the closure context.
    var(n) is needed here to turn the double into a phoenix lambda expression.
  • A Boost.Regex work-alike. Interface closely follows that of TR1 regex (but not exactly)
  • Embed a dynamic regex in a static regex – fully interoperable.
    Unlike Spirit, xpressive regex object embed by value by default.
    There is a special by_ref() helper for embedding a regex by reference. Useful for creating recursive regexes.
  • Xpressive actually has semantic actions, but they're undocumented for 1.0. They'll become official in 2.0.
  • Caveat: not an official Boost library!
  • Transcript

    • 1. 01/30/15 1 Text Processing With Boost O r, "Anything yo u can do , Ican do be tte r"
    • 2. copyright 2006 David Abrahams, Eric Niebler01/30/15 2 Talk Overview Goal: Become productive C++ string manipulators with the help of Boost. 1. The Simple Stuff  Boost.Lexical_cast  Boost.String_algo 2. The Interesting Stuff  Boost.Regex  Boost.Spirit  Boost.Xpressive
    • 3. 01/30/15 3 Part 1: The Simple Stuff Utilities for Ad Hoc Text Manipulation
    • 4. copyright 2006 David Abrahams, Eric Niebler01/30/15 4 A Legacy of Inadequacy  Python: >>> int('123') 123 >>> str(123) '123'  C++: int i = atoi("123"); char buff[10]; itoa(123, buff, 10); No error handling! No error handling! Complicated interface Not actually standard!
    • 5. copyright 2006 David Abrahams, Eric Niebler01/30/15 5 Stringstream: A Better atoi() { std::stringstream sout; std::string str; sout << 123; sout >> str; // OK, str == "123" } { std::stringstream sout; int i; sout << "789"; sout >> i; // OK, i == 789 }
    • 6. copyright 2006 David Abrahams, Eric Niebler01/30/15 6 Boost.Lexical_cast // Approximate implementation ... template< typename Target, typename Source > Target lexical_cast(Source const & arg) { std::stringstream sout; Target result; if(!(sout << arg && sout >> result)) throw bad_lexical_cast( typeid(Source), typeid(Target)); return result; } Kevlin Henney
    • 7. copyright 2006 David Abrahams, Eric Niebler01/30/15 7 Boost.Lexical_cast int i = lexical_cast<int>( "123" ); std::string str = lexical_cast<std::string>( 789 );  Ugly name  Sub-par performance  No i18n  Clean Interface  Error Reporting, Yay!  Extensible
    • 8. copyright 2006 David Abrahams, Eric Niebler01/30/15 8 Boost.String_algo  Extension to std:: algorithms  Includes algorithms for: trimming case-conversions find/replace utilities ... and much more! Pavol Droba
    • 9. copyright 2006 David Abrahams, Eric Niebler01/30/15 9 Hello, String_algo! #include <boost/algorithm/string.hpp> using namespace std; using namespace boost; string str1(" hello world! "); to_upper(str1); // str1 == " HELLO WORLD! " trim(str1); // str1 == "HELLO WORLD!" string str2 = to_lower_copy( ireplace_first_copy( str1, "hello", "goodbye")); // str2 == "goodbye world!" Mutate String In-Place Create a New String Composable Algorithms!
    • 10. copyright 2006 David Abrahams, Eric Niebler01/30/15 10 String_algo: split() std::string str( "abc-*-ABC-*-aBc" ); std::vector< std::string > tokens; split( tokens, str, is_any_of("-*") ); // OK, tokens == { "abc", "ABC", "aBc" } Other Classifications: is_space(), is_upper(), etc. is_from_range('a','z'), is_alnum() || is_punct()
    • 11. 01/30/15 11 Part 2: The Interesting Stuff Structure d Te xt Manipulatio n with Do m ain Spe cific Lang uag e s
    • 12. copyright 2006 David Abrahams, Eric Niebler01/30/15 12 Overview  Declarative Programming and Domain- Specific Languages.  Manipulating Text Dynamically Boost.Regex  Generating Parsers Statically Boost.Spirit  Mixed-Mode Pattern Matching Boost.Xpressive
    • 13. copyright 2006 David Abrahams, Eric Niebler01/30/15 13 Grammar Refresher Imperative Sentence: n. Expressing a command or request. E.g., “Set the TV on fire.” Declarative Sentence: n. Serving to declare or state. E.g., “The TV is on fire.”
    • 14. copyright 2006 David Abrahams, Eric Niebler01/30/15 14 Computer Science Refresher Imperative Programming: n. A programming paradigm that describes computation in terms of a program state and statements that change the program state. Declarative Programming: n. A programming paradigm that describes computation in terms of what to compute, not how to compute it.
    • 15. copyright 2006 David Abrahams, Eric Niebler01/30/15 15 Find/Print an Email Subject std::string line; while (std::getline(std::cin, line)) { if (line.compare(0, 9, "Subject: ") == 0) { std::size_t offset = 9; if (line.compare(offset, 4, "Re: ")) offset += 4; std::cout << line.substr(offset); } }
    • 16. copyright 2006 David Abrahams, Eric Niebler01/30/15 16 Find/Print an Email Subject std::string line; boost::regex pat( "^Subject: (Re: )?(.*)" ); boost::smatch what; while (std::getline(std::cin, line)) { if (boost::regex_match(line, what, pat)) std::cout << what[2]; }
    • 17. copyright 2006 David Abrahams, Eric Niebler01/30/15 17 Which do you prefer? Imperative: Declarative: if (line.compare(...) == 0) { std::size_t offset = ...; if (line.compare(...) == 0) offset += ...; } "^Subject: (Re: )?(.*)"  Describes algorithm  Verbose  Hard to maintain  Describes goal  Concise  Easy to maintain
    • 18. copyright 2006 David Abrahams, Eric Niebler01/30/15 18 Riddle me this ... If declarative is so much better than imperative, why are most popular programming languages imperative?
    • 19. copyright 2006 David Abrahams, Eric Niebler01/30/15 19 Best of Both Worlds  Domain-Specific Embedded Languages A declarative DSL hosted in an imperative general-purpose language.  Examples: Ruby on Rails in Ruby JUnit Test Framework in Java Regex in perl, C/C++, .NET, etc.
    • 20. copyright 2006 David Abrahams, Eric Niebler01/30/15 20 Boost.Regex in Depth  A powerful DSEL for text manipulation  Accepted into std::tr1 Coming in C++0x!  Useful constructs for: matching searching replacing tokenizing John Maddock
    • 21. copyright 2006 David Abrahams, Eric Niebler01/30/15 21 Dynamic DSEL in C++  Embedded statements in strings  Parsed at runtime  Executed by an interpreter  Advantages Free-form syntax New statements can be accepted at runtime  Examples regex: "^Subject: (Re: )?(.*)" SQL: "SELECT * FROM Employees ORDER BY Salary"
    • 22. copyright 2006 David Abrahams, Eric Niebler01/30/15 22 The Regex Language Syntax Meaning ^ Beginning-of-line assertion $ End-of-line assertion . Match any single character [abc] Match any of ‘a’, ‘b’, or ‘c’ [^0-9] Match any character not in the range ‘0’ through ‘9’ w, d, s Match a word, digit, or space character *, +, ? Zero or more, one or more, or optional (postfix, greedy) (stuff) Numbered capture: remember what stuff matches 1 Match what the 1st numbered capture matched
    • 23. copyright 2006 David Abrahams, Eric Niebler01/30/15 24 Algorithm: regex_match  Checks if a pattern matches the whole input.  Example: Match a Social Security Number std::string line; boost::regex ssn("d{3}-dd-d{4}"); while (std::getline(std::cin, line)) { if (boost::regex_match(line, ssn)) break; std::cout << "Invalid SSN. Try again." << std::endl; }
    • 24. copyright 2006 David Abrahams, Eric Niebler01/30/15 25 Algorithm: regex_search  Scans input to find a match  Example: scan HTML for an email address std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); smatch what; if (boost::regex_search(html, what, mailto)) { std::cout << "Email address to spam: " << what[1]; }
    • 25. copyright 2006 David Abrahams, Eric Niebler01/30/15 26 Algorithm: regex_replace  Replaces occurrences of a pattern  Example: Simple URL escaping std::string url = "http://foo.net/this has spaces"; std::string format = "%20"; boost::regex pat(" "); // This changes url to "http://foo.net/this%20has%20spaces" url = boost::regex_replace(url, pat, format);
    • 26. copyright 2006 David Abrahams, Eric Niebler01/30/15 27 Iterator: regex_iterator  Iterates through all occurrences of a pattern  Example: scan HTML for email addresses std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); sregex_iterator begin(html.begin(), html.end(), mailto), end; for (; begin != end; ++begin) { smatch const & what = *begin; std::cout << "Email address to spam: " << what[1] << "n"; }
    • 27. copyright 2006 David Abrahams, Eric Niebler01/30/15 28 Iterator: regex_token_iterator  Tokenizes input according to pattern  Example: scan HTML for email addresses std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); sregex_token_iterator begin(html.begin(), html.end(), mailto, 1), end; std::ostream_iterator<std::string> out(std::cout, "n"); std::copy(begin, end, out); // write all email addresses to std::cout std::string html = …; regex mailto("<a href="mailto:(.*?)">", regex_constants::icase); sregex_token_iterator begin(html.begin(), html.end(), mailto, 1), end; using namespace boost::lambda; std::for_each(begin, end, std::cout << _1 << 'n');
    • 28. copyright 2006 David Abrahams, Eric Niebler01/30/15 29 Regex Challenge!  Write a regex to match balanced, nested braces, e.g. "{ foo { bar } baz }" regex braces("{[^{}]*}"); regex braces("{[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]* ({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{ }]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({ [^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}] *({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^ {}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*( {[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{} ]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[ ^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}]*({[^{}] regex braces("{[^{}]*({[^{}]*}[^{}]*)*}"); regex braces("{[^{}]*({[^{}]*({[^{}]*}[^{}]*)*}[^{}]*)*}"); Not quite. Better, but no. Not there, yet. Whoops!
    • 29. copyright 2006 David Abrahams, Eric Niebler01/30/15 30 It's funny ... laugh. “Some people, when confronted with a problem, think, ‘I know, I’ll use regular expressions.’ Now they have two problems.” --Jamie Zawinski, in comp.lang.emacs
    • 30. copyright 2006 David Abrahams, Eric Niebler01/30/15 31 Introducing Boost.Spirit  Parser Generator similar in purpose to lex / YACC  DSEL for declaring grammars grammars can be recursive DSEL approximates Backus-Naur Form  Statically embedded language Domain-specific statements are composed from C++ expressions. Joel de Guzman
    • 31. copyright 2006 David Abrahams, Eric Niebler01/30/15 32 Static DSEL in C++  Embedded statements are C++ expressions  Parsed at compile time  Generates machine-code, executed directly  Advantages: Syntax-checked by the compiler Better performance Full access to types and data in your program
    • 32. copyright 2006 David Abrahams, Eric Niebler01/30/15 33 Infix Calculator Grammar  In Extended Backus-Naur Form group ::= '(' expr ')' fact ::= integer | group; term ::= fact (('*' fact) | ('/' fact))* expr ::= term (('+' term) | ('-' term))*
    • 33. copyright 2006 David Abrahams, Eric Niebler01/30/15 34 group ::= '(' expr ')' fact ::= integer | group; term ::= fact (('*' fact) | ('/' fact))* expr ::= term (('+' term) | ('-' term))* Infix Calculator Grammar  In Boost.Spirit spirit::rule<> group, fact, term, expr; group = '(' >> expr >> ')'; fact = spirit::int_p | group; term = fact >> *(('*' >> fact) | ('/' >> fact)); expr = term >> *(('+' >> term) | ('-' >> term));
    • 34. copyright 2006 David Abrahams, Eric Niebler01/30/15 36 Spirit Parser Primitives Syntax Meaning ch_p('X') Match literal character 'X' range_p('a','z') Match characters in the range 'a' through 'z' str_p("hello") Match the literal string "hello" chseq_p("ABCD" ) Like str_p, but ignores whitespace anychar_p Matches any single character chset_p("1234") Matches any of '1', '2', '3', or '4' eol_p Matches end-of-line (CR/LF and combinations) end_p Matches end of input nothing_p Matches nothing, always fails
    • 35. copyright 2006 David Abrahams, Eric Niebler01/30/15 37 Spirit Parser Operations Syntax Meaning x >> y Match x followed by y x | y Match x or y ~x Match any char not x (x is a single-char parser) x - y Difference: match x but not y *x Match x zero or more times +x Match x one or more times !x x is optional x[ f ] Semantic action: invoke f when x matches
    • 36. copyright 2006 David Abrahams, Eric Niebler01/30/15 40 Algorithm: spirit::parse #include <boost/spirit.hpp> using namespace boost; int main() { spirit::rule<> group, fact, term, expr; group = '(' >> expr >> ')'; fact = spirit::int_p | group; term = fact >> *(('*' >> fact) | ('/' >> fact)); expr = term >> *(('+' >> term) | ('-' >> term)); assert( spirit::parse("2*(3+4)", expr).full ); assert( ! spirit::parse("2*(3+4", expr).full ); } Parse strings as an expr (“start symbol” = expr). spirit::parse returns a spirit::parse_info<> struct.
    • 37. copyright 2006 David Abrahams, Eric Niebler01/30/15 45 Semantic Actions  Action to take when part of your grammar succeeds void write(char const *begin, char const *end) { std::cout.write(begin, end – begin); } // This prints "hi" to std::cout spirit::parse("{hi}", '{' >> (*alpha_p)[&write] >> '}'); Match alphabetic characters, call write() with range of characters that matched.
    • 38. copyright 2006 David Abrahams, Eric Niebler01/30/15 46 Semantic Actions  A few parsers process input first void write(int d) { std::cout << d; } // This prints "42" to std::cout spirit::parse("(42)", '(' >> int_p[&write] >> ')'); int_p "returns" an int. using namespace lambda; // This prints "42" to std::cout spirit::parse("(42)", '(' >> int_p[cout << _1] >> ')'); We can use a Boost.Lambda expression as a semantic action!
    • 39. copyright 2006 David Abrahams, Eric Niebler01/30/15 50 Should I use Regex or Spirit? Regex Spirit Ad-hoc pattern matching, regular languages   Structured parsing, context-free grammars   Manipulating text   Semantic actions, manipulating program state   Dynamic; new statements at runtime   Static; no new statements at runtime   Exhaustive backtracking semantics  
    • 40. copyright 2006 David Abrahams, Eric Niebler01/30/15 51 A Peek at Xpressive  A regex library in the Spirit of Boost.Regex (pun intended)  Both a static and a dynamic DSEL! Dynamic syntax is similar to Boost.Regex Static syntax is similar to Boost.Spirit using namespace boost::xpressive; sregex dyn = sregex::compile( "Subject: (Re: )?(.*)" ); sregex sta = "Subject: " >> !(s1= "Re: ") >> (s2= *_); dyn is a dynamic regex sta is a static regex
    • 41. copyright 2006 David Abrahams, Eric Niebler01/30/15 52 Xpressive: A Mixed-Mode DSEL  Mix-n-match static and dynamic regex // Get a pattern from the user at runtime: std::string str = get_pattern(); sregex pat = sregex::compile( str ); // Wrap the regex in begin- and end-word assertions: pat = bow >> pat >> eow; sregex braces, not_brace; not_brace = ~(set= '{', '}'); braces = '{' >> *(+not_brace | by_ref(braces)) >> '}';  Embed regexes by reference, too
    • 42. copyright 2006 David Abrahams, Eric Niebler01/30/15 53 Sizing It All Up Regex Spirit Xpr Ad-hoc pattern matching, regular languages    Structured parsing, context-free grammars    Manipulating text    Semantic actions, manipulating program state    Dynamic; new statements at runtime    Static; no new statements at runtime    Exhaustive backtracking semantics    Blessed by TR1   
    • 43. 01/30/15 54 Appendix: Boost and Unicode Future Directions
    • 44. copyright 2006 David Abrahams, Eric Niebler01/30/15 55 Wouldn't it be nice ... Hmm ... where, oh where, is Boost.Unicode?
    • 45. copyright 2006 David Abrahams, Eric Niebler01/30/15 56 UTF-8 Conversion Facet  Converts UTF-8 input to UCS-4  For use with std::locale  Implementation detail!  But useful nonetheless
    • 46. copyright 2006 David Abrahams, Eric Niebler01/30/15 57 UTF-8 Conversion Facet #define BOOST_UTF8_BEGIN_NAMESPACE #define BOOST_UTF8_END_NAMESPACE #define BOOST_UTF8_DECL #include <fstream> #include <boost/detail/utf8_codecvt_facet.hpp> #include <libs/detail/utf8_codecvt_facet.cpp> int main() { std::wstring str; std::wifstream bad("C:utf8.txt"); bad >> str; assert( str == L"äöü" ); // OOPS! :-( std::wifstream good("C:utf8.txt"); good.imbue(std::locale(std::locale(), new utf8_codecvt_facet)); good >> str; assert( str == L"äöü" ); // SUCCESS!! :-) }
    • 47. copyright 2006 David Abrahams, Eric Niebler01/30/15 58 Thanks!  Boost: http://boost.org  BoostCon: http://www.boostcon.org Questions?

    ×