Extracting data from text documents using the regex


Published on

Slide Deck from RegEx meets .Net at Silicon Valley Code Camp 2011

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Extracting data from text documents using the regex

  1. 1. Extracting Data from Text Documents using the Regex Class Steve Mylroie
  2. 2. Bio Steve Mylroie <ul><li>Current Status </li></ul><ul><ul><li>Semi Retired – 1099 Consultant (Microsoft Stack) </li></ul></ul><ul><li>Baynet Roles: </li></ul><ul><ul><li>Co-Chair South Bay Chapter, Treasurer, Board Member </li></ul></ul><ul><li>Employment History – 40 + years </li></ul><ul><ul><li>Semiconductor Industry </li></ul></ul><ul><ul><ul><li>Signetic, NV Philips, Monolithic Memories, AMD, KLA-Tencor, Promise System(Samsung) </li></ul></ul></ul><ul><ul><ul><li>Process Development, TCAD, Metrology Tools, Factory Management Software, Shop Floor Control systems </li></ul></ul></ul><ul><ul><li>Medical Startups </li></ul></ul><ul><ul><ul><li>QuickSilver Systems Lummisys (Ultrasound Image Management) (NT) </li></ul></ul></ul><ul><ul><ul><li>5 Degree Bios (Cancer Treatment Planning) (Dotnet Nuke) </li></ul></ul></ul><ul><ul><li>Education </li></ul></ul><ul><ul><ul><li>BSEE U of W MS and Phd EE Stanford </li></ul></ul></ul>
  3. 3. Roadmap <ul><li>The BAYNET Application (Problem Statement) </li></ul><ul><ul><li>Email Deliver Error Reports (RFCs 821, 1893, 2043, 3436 ) </li></ul></ul><ul><li>What is Regex </li></ul><ul><li>The .Net Implementation </li></ul><ul><li>Regex syntax (brief) </li></ul><ul><li>Code </li></ul><ul><ul><li>SMTPExtendedStatusMessage </li></ul></ul><ul><ul><li>SMTPDiagnosticMessage </li></ul></ul><ul><ul><li>SMTPDeliveryErrorReport </li></ul></ul><ul><ul><li>Main </li></ul></ul><ul><li>Demo </li></ul><ul><li>Regex vs String Class (Time Permitting) </li></ul>
  4. 4. SMTP Email Delivery Failure Reports <ul><li>Baynet was receiving arround 1050 of these files with every meeting announcement posting. </li></ul><ul><li>Need to Automate Analysis </li></ul><ul><li>Needed to transfer error information to a Database for reporting, analysis and correction </li></ul>
  5. 5. Email Error Reports <ul><li>Returned in the body of a textual Delivery Status Notification prefix to original message, which is returned to the sender </li></ul><ul><li>RFC 821 Section 4 - 3 digit reply codes 4XX &5XX error replies </li></ul><ul><li>RFC 1893, RFC 2043, RFC 3464 Extended Error Codes </li></ul><ul><ul><li>three dot separated fields </li></ul></ul><ul><ul><ul><li>Class Field, One Digit 2, 4 or 5 </li></ul></ul></ul><ul><ul><ul><li>Subject Three digit </li></ul></ul></ul><ul><ul><ul><li>Detail Three digit </li></ul></ul></ul><ul><ul><ul><li>C.SSS.DDD </li></ul></ul></ul>
  6. 6. What is Regex <ul><li>Text parsing and Text Replacement Utility based on a Pattern matching syntax </li></ul><ul><ul><li>Original a UNIX shell utility </li></ul></ul><ul><ul><li>Today there are UNIX & LINUX shell utilities, C++ and, Java Libraries, Java Script, PHP, Ruby, Phylton, Pearl, PowerShell, VB 6, MySQL, Oracle, PostgreSQL awk and VBScript implementation in addition to a .Net version </li></ul></ul><ul><ul><li>Web site devoted to Regex </li></ul></ul><ul><ul><ul><li>http://www.regular-expressions.info/ </li></ul></ul></ul><ul><ul><ul><li>http://www.regexlib.com/ </li></ul></ul></ul><ul><ul><ul><li>http://regexlib.com/CheatSheet.aspx </li></ul></ul></ul><ul><ul><li>Regex documentation in Visual Studio Help </li></ul></ul><ul><ul><ul><li>http://msdn.microsoft.com/en-us/library/az24scfc(VS.90).aspx </li></ul></ul></ul><ul><ul><ul><li>http://msdn.microsoft.com/en-us/library/az24scfc.aspx </li></ul></ul></ul>
  7. 7. .Net Implementation <ul><li>Class Regex (System.dll) </li></ul><ul><li>Namespace System.Text.RegularExpression </li></ul><ul><li>Static and Dynamic Implementation </li></ul><ul><li>Static Implementation Public Methods </li></ul><ul><ul><li>IsMatch, Match, Matches, Replace, Split, Escape, Unescape, CombileToAssembly </li></ul></ul><ul><li>Static Implementation Public Property </li></ul><ul><ul><li>CacheSize (Default size 15) </li></ul></ul><ul><li>Versions </li></ul><ul><ul><li>Net 1.0+, Compact Framework 1.0+, Silverlight, XNA 1.0 </li></ul></ul><ul><ul><li>(Pearl 5 compatiblity or ECMA compatibility) </li></ul></ul>
  8. 8. .Net Implementation (Cont) <ul><li>Contructors </li></ul><ul><ul><li>Regex(String pattern) </li></ul></ul><ul><ul><li>Regex(String pattern, string Options) </li></ul></ul><ul><li>Dynamic Version </li></ul><ul><ul><li>Same Method More parameter options </li></ul></ul><ul><ul><li>Two Added Properties (Options, RightToLeft) </li></ul></ul><ul><ul><li>Compiled but not cached by default </li></ul></ul><ul><li>Most Options can also be set using inline elements in the pattern string </li></ul><ul><li>BackReferences </li></ul><ul><li>System.Web.RegularExpresions </li></ul>
  9. 9. Other Objects In The Namespace <ul><li>Match Object </li></ul><ul><li>Matches Collection </li></ul><ul><li>Group Object </li></ul><ul><li>Groups Collection </li></ul><ul><li>Captured Group Object </li></ul><ul><li>Capture Collection </li></ul><ul><li>Individual Capture </li></ul><ul><li>Regular Expression Engine </li></ul>
  10. 10. Regex Pattern Syntax <ul><li>MSDN Web Page containing enumeration of regex expression syntax </li></ul><ul><li>http://msdn.microsoft.com/en-us/library/az24scfc.aspx </li></ul>
  11. 11. Regex Syntax Example U.S. Currency Valuator Pattern as C# String @&quot;^s*[+-]?s?$?s?d+(.d{2})?{$&quot; ^ Start at the beginning of the string. s* Match zero or more white-space characters. [+-]? Match zero or one occurrence of either the positive sign or the negative sign. s? Match zero or one white-space character. $? Match zero or one occurrence of the dollar sign. s? Match zero or one white-space character. d+ Match one or more decimal digits. .? Match zero or one decimal point symbol. d{2}? Match two decimal digits zero or one time. (d*.?d{2}?){1} Match the pattern of integral and fractional digits separated by a decimal point symbol at least one time. $ Match the end of the string.
  12. 12. Database Schema
  13. 13. Code Example <ul><li>Enough Power Point time for Some Code </li></ul>
  14. 15. Demos <ul><li>Demo (10 Error Reports) </li></ul><ul><li>Query Demos (Full DataSet) </li></ul>