Regular Expressions -- SAS and Perl

3,247 views

Published on

The SAS System provides two declarative syntax languages for regular expressions: SAS and Perl. This presentation compares and contrasts these two complementary choices for SAS application developers.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,247
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
66
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Regular Expressions -- SAS and Perl

  1. 1. Regular Expressions – SAS® (RX) vs. Perl (PRX) P l Mark Tabladillo Ph.D. April 10, 2005 © 2005, markTab Consulting, All Rights Reserved
  2. 2. Motivation The SAS System Version 9 introduces Perl regular expressions (PRX) Earlier software versions already had SAS regular expressions (RX) © 2005, markTab Consulting, All Rights Reserved
  3. 3. Purpose This presentation will compare and contrast the two types of regular expressions (RX and PRX) from both the functionality and performance viewpoints The goal: Offer recommendations on when to use the two types Application: Two generic examples will A li ti T i l ill illustrate the recommended strategy © 2005, markTab Consulting, All Rights Reserved
  4. 4. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
  5. 5. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
  6. 6. Vocabulary Pattern matching enables you to search for and g y extract multiple matching patterns from a character string in one step, as well as to make several substitutions in a string in one step g p Regular expressions are a pattern language which provides fast tools for parsing large amounts of text. Metacharacters are special combinations of alphanumeric and/or symbolic characters which have specific meaning in defining a regular expression. Ch t l Character classes are single or combinations of i l bi ti f alphanumeric and/or symbolic characters which represent themselves. © 2005, markTab Consulting, All Rights Reserved
  7. 7. Is “One Step Realistic? One Step” Practical uses of regular expressions use more than one step Regular expressions provide a powerful parsimonious syntax for string manipulation © 2005, markTab Consulting, All Rights Reserved
  8. 8. When to Use Regular Expressions Anything done in regular expressions could be coded another way Many people do not use metacharacters in (for example) Google® searches Hi h-volume or complex string processing High- High l l ti i (such as in a data step) provides excellent potential t ti l © 2005, markTab Consulting, All Rights Reserved
  9. 9. Why Regular Expressions can be Confusing C f i Regular expressions are a combination of: – Alphanumeric and/or symbolic characters representing themselves (character classes) (character classes) – Special combinations of alphanumeric and/or symbolic characters (metacharacters) representing (metacharacters) zero or more combinations of alphanumeric and/or symbolic characters – Specially flagged combinations of alphanumeric and/or symbolic characters which would normally be interpreted as metacharacters, but instead represent themselves (character classes) (character classes) © 2005, markTab Consulting, All Rights Reserved
  10. 10. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
  11. 11. Similarity One: Parse Function PARSE is the core function of creating a regular expression in memory using metacharacters, and assigning this regular , g g g expression to a numeric SAS variable, called the regular expression ID. ID. The term ID refers to identification, and SAS will assign every PARSE function to a different and unique numeric value, and diff t d i i l d track those values automatically. © 2005, markTab Consulting, All Rights Reserved
  12. 12. Similarity One: Parse Function The programming challenge is to create a regular expression which generically describes a character string pattern Metacharacters for SAS (RX) and Perl (PRX) regular expressions are usually different, but either method can be used to create a similar if not identical result © 2005, markTab Consulting, All Rights Reserved
  13. 13. Similarity One: Example In this first e a p e (S S Institute, 2003), t e t s st example (SAS st tute, 003), the goal is to find a pattern that matches (XXX) XXX- XXX- XXXX or XXX-XXX-XXXX for phone numbers in XXX-XXX- the United States. States – The first three digits are the area code, and by standardized rules, the area code cannot start with a zero or a one. – The fourth through sixth digits are the prefix, and again by standard rules, the prefix also cannot start with a zero or one. – The suffix may have any digit, including zero or one, in any of the four places. places © 2005, markTab Consulting, All Rights Reserved
  14. 14. Phone Number: Perl (PRX) paren = quot;([2-9]dd) ?[2-9]dd- quot;([2-9] ?[2-9] ddddquot;; dash = quot;[2-9]dd-[2-9]dd-ddddquot;; [2- [2 9] [2-9] d ; regexp = quot;/(quot; || paren || quot;)|(quot; || dash || quot;)/quot;; quot;)/quot; See the Paper for the full code and explanation © 2005, markTab Consulting, All Rights Reserved
  15. 15. Phone Number: SAS (RX) paren = quot;'('$'2-9 $d$d ) [ ']$'2-9'$d$d'- quot;'('$'2-9'$d$d')'[' ']$'2-9'$d$d'- ($2 ]$ 2 9 $d$d '$d$d$d$dquot;; dash = quot;$'2-9'$d$d'-'$'2-9'$d$d'- $ 2 9 $d$d $ 2 9 $d$d 2- 2- '$d$d$d$dquot;; regexp = paren || quot;|quot; || d h dash; See the Paper for the full code and explanation © 2005, markTab Consulting, All Rights Reserved
  16. 16. Comparing the Methods A SAS Macro was created to compare the methods One iteration did not show a difference, so difference the iterations were increased to 500 SAS (RX) wins at 3.69 seconds compared i t 3 69 d d to Perl (PRX) at 3.80 seconds Point: If speed is an issue, you may try the two methods to see who wins © 2005, markTab Consulting, All Rights Reserved
  17. 17. Similarity Two: Matching The matching function uses the regular expression to determine a specific numeric position in a string The return from a match function is a number representing a character position © 2005, markTab Consulting, All Rights Reserved
  18. 18. Similarity Three: Substring The substring routine allows for inputting a regular expression and string, and outputting a position and length Routines (unlike functions) can have variable numbers of inputs and outputs, outputs as in the substring routine © 2005, markTab Consulting, All Rights Reserved
  19. 19. Similarity Four: Change The change routine allows for inputting a regular expression, a maximum number of times to replace an old string and replace, string, outputs a new string Both SAS (RX) and Perl (PRX) allow for changing a string in place © 2005, markTab Consulting, All Rights Reserved
  20. 20. Similarity Five: Free The free routine releases the memory allocation for the regular expression It is recommended to always include a FREE routine to prevent problems © 2005, markTab Consulting, All Rights Reserved
  21. 21. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
  22. 22. Capture Buffers Perl (PRX) regular expressions can use capture buffers, defined as part of a match explicitly specified in the Perl p y p regular expression The capture buffers are collectively a one- p y one- dimensional numbered array of results (starting at one, not zero) Example: Parts of a phone number More than one step is required p q © 2005, markTab Consulting, All Rights Reserved
  23. 23. Unique Feature One: PRXPOSN Routine i The PRXPOSN routine finds the start position and length of a numbered capture buffer © 2005, markTab Consulting, All Rights Reserved
  24. 24. Unique Feature Two: PRXPOSN Function i The PRXPOSN Function uses the positional capture buffer number to return the actual string in the capture buffer This function is probably more useful than the PRXPOSN routine © 2005, markTab Consulting, All Rights Reserved
  25. 25. Unique Feature Three: PRXPAREN The PRXPAREN function assumes that the capture buffer was an ordered hierarchical array and will return the highest non- array, non- missing capture buffer number See the paper for an example © 2005, markTab Consulting, All Rights Reserved
  26. 26. Unique Feature Four: PRXNEXT Similar to PRXMATCH the PRXNEXT PRXMATCH, routine will iteratively search a string for matches Not based on the capture buffer Useful h U f l when a string can have multiple, ti h lti l even overlapping, matches © 2005, markTab Consulting, All Rights Reserved
  27. 27. Unique Feature Five: PRXDEBUG The PRXDEBUG routine writes debugging messages to the log Provides insight into how regular expression functions and routines search through specific strings Debugging works best when smaller pieces are checked first, building toward i h k d fi t b ildi t d the whole regular expression © 2005, markTab Consulting, All Rights Reserved
  28. 28. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
  29. 29. Recommended Strategy Use the type which has the desired functionality If you don’t know either, start with Perl don t either regular expressions (PRX) If you are l ki at performance or looking t f speed issues, try tests both ways (RX and PRX) © 2005, markTab Consulting, All Rights Reserved
  30. 30. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
  31. 31. Example One: Printer Names The Universal Naming Convention describes printers as: computer nameprinter_shared_name computer_name printer shared name computer_name name The SYSPRINT option returns or sets the UNC printer name © 2005, markTab Consulting, All Rights Reserved
  32. 32. Example One: Printer Name Problem: A variety of legal UNC formats: – computer_nameprinter_shared_name computer_name – (computer_nameprinter shared name) computer_name printer_shared_name) computer nameprinter_shared_name name name) – (“computer_nameprinter_shared_name’) (“ computer_nameprinter_shared_name’) 12 printers * 3 formats = 36 combinations i t f t bi ti SAS (RX) could be used with 3 separate regular expressions Perl (PRX) capture buffer used ( ) p © 2005, markTab Consulting, All Rights Reserved
  33. 33. Example One: PRX '/( '/([-w]+|[-w]+)/' /( /( w]+|[- w]+)/ The regular expression will extract the printer name without the braces, or name, braces brackets, or quotation marks See the S th paper f explanation for l ti © 2005, markTab Consulting, All Rights Reserved
  34. 34. Example Two: Windows Subdirectory S bdi Get the subdirectory from the longer string which started with the drive name and ended with a specific filename: – X:Sub_Directory_1Sub_Directory_2...Sub X: Sub_Directory_1Sub_Directory_2... _Directory_NFilename Extension _Directory_NFilename.Extension Directory N N As in the previous example, the original string includes the backslash, which is a backslash Perl delimiting metacharacter © 2005, markTab Consulting, All Rights Reserved
  35. 35. Example Two: Regular Expression '/([A-Za-z]:[. '/([A-Za-z]:[ -w]+)([ -w]+)([ - /([A w]+) ([. w]+) ([. w]+)/' The regular expression creates three capture buffers, with the second capture buffer containing the string of interest See the paper for a full explanation © 2005, markTab Consulting, All Rights Reserved
  36. 36. Conclusion With version 9, SAS programmers have 9 two regular expression choices: SAS (RX) and Perl (PRX) The presentation described similarities and differences and offered a recommended differences, strategy The Th paper contains three detailed t i th d t il d examples, and an annotated bibliography © 2005, markTab Consulting, All Rights Reserved

×