Regular Expressions -- SAS and Perl

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Regular Expressions -- SAS and Perl - Presentation Transcript

    1. Regular Expressions – SAS® (RX) vs. Perl (PRX) P l Mark Tabladillo Ph.D. April 10, 2005 © 2005, markTab Consulting, All Rights Reserved
    2. Motivation The SAS System Version 9 introduces Perl regular expressions (PRX) Earlier software versions already had SAS regular expressions (RX) © 2005, markTab Consulting, All Rights Reserved
    3. Purpose This presentation will compare and contrast the two types of regular expressions (RX and PRX) from both the functionality and performance viewpoints The goal: Offer recommendations on when to use the two types Application: Two generic examples will A li ti T i l ill illustrate the recommended strategy © 2005, markTab Consulting, All Rights Reserved
    4. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
    5. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
    6. Vocabulary Pattern matching enables you to search for and g y extract multiple matching patterns from a character string in one step, as well as to make several substitutions in a string in one step g p Regular expressions are a pattern language which provides fast tools for parsing large amounts of text. Metacharacters are special combinations of alphanumeric and/or symbolic characters which have specific meaning in defining a regular expression. Ch t l Character classes are single or combinations of i l bi ti f alphanumeric and/or symbolic characters which represent themselves. © 2005, markTab Consulting, All Rights Reserved
    7. Is “One Step Realistic? One Step” Practical uses of regular expressions use more than one step Regular expressions provide a powerful parsimonious syntax for string manipulation © 2005, markTab Consulting, All Rights Reserved
    8. When to Use Regular Expressions Anything done in regular expressions could be coded another way Many people do not use metacharacters in (for example) Google® searches Hi h-volume or complex string processing High- High l l ti i (such as in a data step) provides excellent potential t ti l © 2005, markTab Consulting, All Rights Reserved
    9. Why Regular Expressions can be Confusing C f i Regular expressions are a combination of: – Alphanumeric and/or symbolic characters representing themselves (character classes) (character classes) – Special combinations of alphanumeric and/or symbolic characters (metacharacters) representing (metacharacters) zero or more combinations of alphanumeric and/or symbolic characters – Specially flagged combinations of alphanumeric and/or symbolic characters which would normally be interpreted as metacharacters, but instead represent themselves (character classes) (character classes) © 2005, markTab Consulting, All Rights Reserved
    10. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
    11. Similarity One: Parse Function PARSE is the core function of creating a regular expression in memory using metacharacters, and assigning this regular , g g g expression to a numeric SAS variable, called the regular expression ID. ID. The term ID refers to identification, and SAS will assign every PARSE function to a different and unique numeric value, and diff t d i i l d track those values automatically. © 2005, markTab Consulting, All Rights Reserved
    12. Similarity One: Parse Function The programming challenge is to create a regular expression which generically describes a character string pattern Metacharacters for SAS (RX) and Perl (PRX) regular expressions are usually different, but either method can be used to create a similar if not identical result © 2005, markTab Consulting, All Rights Reserved
    13. Similarity One: Example In this first e a p e (S S Institute, 2003), t e t s st example (SAS st tute, 003), the goal is to find a pattern that matches (XXX) XXX- XXX- XXXX or XXX-XXX-XXXX for phone numbers in XXX-XXX- the United States. States – The first three digits are the area code, and by standardized rules, the area code cannot start with a zero or a one. – The fourth through sixth digits are the prefix, and again by standard rules, the prefix also cannot start with a zero or one. – The suffix may have any digit, including zero or one, in any of the four places. places © 2005, markTab Consulting, All Rights Reserved
    14. Phone Number: Perl (PRX) paren = \"\\([2-9]\\d\\d\\) ?[2-9]\\d\\d- \"\\([2-9]\\ ?[2-9]\\ \\d\\d\\d\\d\"; dash = \"[2-9]\\d\\d-[2-9]\\d\\d-\\d\\d\\d\\d\"; [2- [2 9]\\ [2-9]\\ d ; regexp = \"/(\" || paren || \")|(\" || dash || \")/\"; \")/\" See the Paper for the full code and explanation © 2005, markTab Consulting, All Rights Reserved
    15. Phone Number: SAS (RX) paren = \"'('$'2-9 $d$d ) [ ']$'2-9'$d$d'- \"'('$'2-9'$d$d')'[' ']$'2-9'$d$d'- ($2 ]$ 2 9 $d$d '$d$d$d$d\"; dash = \"$'2-9'$d$d'-'$'2-9'$d$d'- $ 2 9 $d$d $ 2 9 $d$d 2- 2- '$d$d$d$d\"; regexp = paren || \"|\" || d h dash; See the Paper for the full code and explanation © 2005, markTab Consulting, All Rights Reserved
    16. Comparing the Methods A SAS Macro was created to compare the methods One iteration did not show a difference, so difference the iterations were increased to 500 SAS (RX) wins at 3.69 seconds compared i t 3 69 d d to Perl (PRX) at 3.80 seconds Point: If speed is an issue, you may try the two methods to see who wins © 2005, markTab Consulting, All Rights Reserved
    17. Similarity Two: Matching The matching function uses the regular expression to determine a specific numeric position in a string The return from a match function is a number representing a character position © 2005, markTab Consulting, All Rights Reserved
    18. Similarity Three: Substring The substring routine allows for inputting a regular expression and string, and outputting a position and length Routines (unlike functions) can have variable numbers of inputs and outputs, outputs as in the substring routine © 2005, markTab Consulting, All Rights Reserved
    19. Similarity Four: Change The change routine allows for inputting a regular expression, a maximum number of times to replace an old string and replace, string, outputs a new string Both SAS (RX) and Perl (PRX) allow for changing a string in place © 2005, markTab Consulting, All Rights Reserved
    20. Similarity Five: Free The free routine releases the memory allocation for the regular expression It is recommended to always include a FREE routine to prevent problems © 2005, markTab Consulting, All Rights Reserved
    21. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
    22. Capture Buffers Perl (PRX) regular expressions can use capture buffers, defined as part of a match explicitly specified in the Perl p y p regular expression The capture buffers are collectively a one- p y one- dimensional numbered array of results (starting at one, not zero) Example: Parts of a phone number More than one step is required p q © 2005, markTab Consulting, All Rights Reserved
    23. Unique Feature One: PRXPOSN Routine i The PRXPOSN routine finds the start position and length of a numbered capture buffer © 2005, markTab Consulting, All Rights Reserved
    24. Unique Feature Two: PRXPOSN Function i The PRXPOSN Function uses the positional capture buffer number to return the actual string in the capture buffer This function is probably more useful than the PRXPOSN routine © 2005, markTab Consulting, All Rights Reserved
    25. Unique Feature Three: PRXPAREN The PRXPAREN function assumes that the capture buffer was an ordered hierarchical array and will return the highest non- array, non- missing capture buffer number See the paper for an example © 2005, markTab Consulting, All Rights Reserved
    26. Unique Feature Four: PRXNEXT Similar to PRXMATCH the PRXNEXT PRXMATCH, routine will iteratively search a string for matches Not based on the capture buffer Useful h U f l when a string can have multiple, ti h lti l even overlapping, matches © 2005, markTab Consulting, All Rights Reserved
    27. Unique Feature Five: PRXDEBUG The PRXDEBUG routine writes debugging messages to the log Provides insight into how regular expression functions and routines search through specific strings Debugging works best when smaller pieces are checked first, building toward i h k d fi t b ildi t d the whole regular expression © 2005, markTab Consulting, All Rights Reserved
    28. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
    29. Recommended Strategy Use the type which has the desired functionality If you don’t know either, start with Perl don t either regular expressions (PRX) If you are l ki at performance or looking t f speed issues, try tests both ways (RX and PRX) © 2005, markTab Consulting, All Rights Reserved
    30. Outline Background Similarities between SAS (RX) and Perl Regular Expressions (PRX) Unique Perl Regular Expression (PRX) Capabilities C biliti Recommended Strategy for SAS (RX) and Perl Regular Expressions (PRX) Two Examples of Recommended Strategy p gy © 2005, markTab Consulting, All Rights Reserved
    31. Example One: Printer Names The Universal Naming Convention describes printers as: \\\\computer name\\printer_shared_name computer_name printer shared name computer_name\\ name\\ The SYSPRINT option returns or sets the UNC printer name © 2005, markTab Consulting, All Rights Reserved
    32. Example One: Printer Name Problem: A variety of legal UNC formats: – \\\\computer_name\\printer_shared_name computer_name\\ – (\\\\computer_name\\printer shared name) computer_name printer_shared_name) computer name\\printer_shared_name name\\ name) – (“\\\\computer_name\\printer_shared_name’) (“\\ computer_name\\printer_shared_name’) 12 printers * 3 formats = 36 combinations i t f t bi ti SAS (RX) could be used with 3 separate regular expressions Perl (PRX) capture buffer used ( ) p © 2005, markTab Consulting, All Rights Reserved
    33. Example One: PRX '/(\\ '/(\\\\\\\\[-\\\\\\w]+|[-\\w]+)/' /(\\ /( w]+|[- w]+)/ The regular expression will extract the printer name without the braces, or name, braces brackets, or quotation marks See the S th paper f explanation for l ti © 2005, markTab Consulting, All Rights Reserved
    34. Example Two: Windows Subdirectory S bdi Get the subdirectory from the longer string which started with the drive name and ended with a specific filename: – X:\\\\Sub_Directory_1\\Sub_Directory_2\\...\\Sub X:\\ Sub_Directory_1\\Sub_Directory_2\\...\\ _Directory_N\\Filename Extension _Directory_N\\Filename.Extension Directory N N\\ As in the previous example, the original string includes the backslash, which is a backslash Perl delimiting metacharacter © 2005, markTab Consulting, All Rights Reserved
    35. Example Two: Regular Expression '/([A-Za-z]:[. '/([A-Za-z]:[ -\\\\\\w]+)\\\\([ -\\w]+)\\\\([ - /([A w]+)\\ ([. w]+)\\ ([. \\w]+)/' The regular expression creates three capture buffers, with the second capture buffer containing the string of interest See the paper for a full explanation © 2005, markTab Consulting, All Rights Reserved
    36. Conclusion With version 9, SAS programmers have 9 two regular expression choices: SAS (RX) and Perl (PRX) The presentation described similarities and differences and offered a recommended differences, strategy The Th paper contains three detailed t i th d t il d examples, and an annotated bibliography © 2005, markTab Consulting, All Rights Reserved

    + Mark TabladilloMark Tabladillo, 11 months ago

    custom

    1554 views, 1 favs, 0 embeds more stats

    The SAS System provides two declarative syntax lang more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1554
      • 1554 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 22
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories