Regexes in .NET

364 views
265 views

Published on

Regular Expressions (regex) language overview and the .NET framework.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
364
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Regexes in .NET

  1. 1. Regexes SoftFluent day 03/10/2013 Pablo Fernandez Duran
  2. 2. Reg-what? • • • • • • • Regular expressions Describing a search pattern Find and replace operations 1950 Regular language, formal language … Different flavors -> PCRE (Perl Compatible Regular Expressions) Now… not so regular
  3. 3. regex regexp reg-exp ^reg-?ex(?(?<=-ex)p|p?)(?(?<=x)e[sn]|s)?$ regexps reg-exps regexes regexen var re = new RegExp(/.*/); // js var re = new Regex(".*"); // .NET
  4. 4. • We have a problem. • Let’s use regexes ! • Now we have two problems.
  5. 5. What about you ? • Can you read regexes ? ^[0-9]w*$ • Can you really read regexes ? ^[^)(]*((?>[^()]+|((?<p>)|)(?<-p>))*(?(p)(?!)))[^)(]*$
  6. 6. Language overview • Character classes • • • • • • w (writable) d (decimals) s (spacing) W (not w) . (wildcard) D (not d) S (not s) Character group [abc] Negation [^a1] Range [C-F] or [2-6A-D] Differences [A-Z-[B]] Anchors • ^ (beginning of string or line) $ (end of string or line) b (word boundary) B (not b)
  7. 7. Language overview • • Quantifiers • • • • Range : {n,m} , {n,} Zero or more : * (can be written {0,}) One or more : + (can be written {1,}) Zero or one : ? (can be written {0,1}) Greedy vs Lazy • • Greedy : the longest match (by default) Lazy : the shortest match • *? , +? , ?? , {n,m}?
  8. 8. Language overview • • Grouping constructs • • • • Capturing group : (subexpression) Named group : (?<group_name>subexpression) Non capturing group : (?:subexpression) Balancing groups : (?<name1-name2>subexpression) Look around assertions (zero length) • • • • Positive look ahead : (?=subexpression) Negative look ahead : (?!subexpression) Positive look behind : (?<=subexpression) Negative look behind : (?<!subexpression)
  9. 9. Language overview • Backreference constructs • groupnumber or k<groupname> • Alternation constructs • • • (expression1|..|expressionn) (?(expression)yes|no) (?(referenced group)yes|no)
  10. 10. Format/Comment your code As you do it when you write code… public static void C(string an, string pn, string n, string nn) { RegexCompilationInfo[] re = { new RegexCompilationInfo(pn, RegexOptions.Compiled, n, nn, true) }; System.Reflection.AssemblyName asn = new System.Reflection.AssemblyName(); asn.Name = an; Regex.CompileToAssembly(re, asn); } Regexes can have inline comments: (#comment) And can be written in multiple lines (don’t forget the IgnorePatternWhitespace option ):
  11. 11. Before: ^[^()]*((?<g>()[^()]*)*((?<-g>))[^()]*)*[^()]*(?(g)(?!))$ After: ^ #start [^()]* #everything but () ( (?<g>() #opening group ( [^()]* #everything but () )* ( (?<-g>)) #closing group ) [^()]* #everything but () )* [^()]* #everything but () (?(g) #if opening group remaining (?!)) #then make match fail $ #end
  12. 12. In .NET / C# • A class to know : System.Text.RegularExpressions.Regex • Represents the Regex engine • A pattern is tightly coupled to the regex engine • All regular expressions must be compiled (sooner or later) • Initialization can be an expensive process
  13. 13. Regex options • • • • • • • • • • None IgnoreCase Multiline Singleline ExplicitCapture Compiled IgnorePatternWhitespace RightToLeft ECMAScript CultureInvariant http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx
  14. 14. Instance or Static method calls ? • Both provide the same matching/replacing methods • Static method calls use caching (15 by default) • Manage the cache size using Regex.CacheSize • Only static calls use caching (since .NET 2.0)
  15. 15. Instance or Static method calls ? • new Regex(pattern).IsMatch(email) Vs • Regex.IsMatch(email, pattern) Data from: http://blogs.msdn.com/b/bclteam/archive/2010/06/25/optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regexobjects.aspx
  16. 16. Interpreted or compiled • Interpreted: • • • • opcodes converted to MSIL and executed by the JIT when the method is called. Startup time reduced but slower execution time Compiled (RegexOptions.Compiled): • • • • opcodes created on initialization (static or instance). regex converted to MSIL code. MSIL code executed by the JIT when the method is called. Execution time reduced but slower startup time. Compiled on design time: • • • Regex.CompileToAssembly The regex is fixed and used only in instance calls. Startup and execution time reduced at run-time but must be done design time.
  17. 17. Interpreted or compiled Data from: http://blogs.msdn.com/b/bclteam/archive/2010/06/25/opti mizing-regular-expression-performance-part-i-workingwith-the-regex-class-and-regex-objects.aspx
  18. 18. Tools • Regex Design • Expresso • The regex coach • Regex buddy (not free) • Rex (microsoft research) • Visual Studio
  19. 19. Bonus • Mail::RFC822::Address: regexp-based address validation http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html • A regular expression to check for prime numbers: ^1?$|^(11+?)1+$ http://montreal.pm.org/tech/neil_kandalgaonkar.shtml • RegEx match open tags except XHTML self-contained tags (stackoverflow) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
  20. 20. Regex optimization • • • • • Time out Consider the input source Capture only when necessary Factorization Backtracking “In general, a Nondeterministic Finite Automaton (NFA) engine like the .NET Framework regular expression engine places the responsibility for crafting efficient, fast regular expressions on the developer.”
  21. 21. ?

×