Š Prof Mukesh N Tekwani, 2016
1 / 6
Unit I Chap 3 : Python – Regular Expressions
3.1 Concept of Regular Expression
A regular expression is a special sequence of characters that helps you match or
find other strings or sets of strings, using a specialized syntax held in a pattern.
Regular expressions are widely used in text pattern matching, text extraction,
and search-and-replace facility. Regular expressions are also called REs, or
regexes or regex patterns.
The module re provides full support for regular expressions in Python. The re
module raises the exception re.error if an error occurs while compiling or using
a regular expression. We can specify the rules for the set of possible strings that
we want to match; this set might contain English sentences, or e-mail addresses, or
anything you like. You can then ask questions such as “Does this string match the
pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also
use REs to modify a string or to split it apart in various ways.
The patterns or regular expressions can be defined as follows:
● Literal characters must match exactly. For example, "a" matches "a".
● Concatenated patterns match concatenated targets. For example, "ab"
("a" followed by "b") matches "ab".
● Alternate patterns (separated by a vertical bar) match either of the
alternative patterns. For example, "(aaa)|(bbb)" will match either "aaa"
or "bbb".
● Repeating and optional items:
○ "abc*" matches "ab" followed by zero or more occurrences of "c",
for example, "ab", "abc", "abcc", etc.
○ "abc+" matches "ab" followed by one or more occurrences of "c",
for example, "abc", "abcc", etc, but not "ab".
○ "abc?" matches "ab" followed by zero or one occurrences of "c",
for example, "ab" or "abc".
● Sets of characters -- Characters and sequences of characters in square
brackets form a set; a set matches any character in the set or range. For
example, "[abc]" matches "a" or "b" or "c". And, for example, "[_a-z0-
9]" matches an underscore or any lower-case letter or any digit.
● Groups -- Parentheses indicate a group with a pattern. For example,
"ab(cd)*ef" is a pattern that matches "ab" followed by any number of
occurrences of "cd" followed by "ef", for example, "abef", "abcdef",
"abcdcdef", etc.
● There are special names for some sets of characters, for example "d"
(any digit), "w" (any alphanumeric character), "W" (any non-
alphanumeric character), etc.
Š Prof Mukesh N Tekwani, 2016
2 / 6
3.2 Metacharacters
In forming a regular expression we use certain characters as metacharacters.
These characters don’t match themselves but they indicate that some other
thing should be matched.
The complete list of metacharacters is:
. ^ $ * + ? { } [ ]  | ( )
Metacharacters [ and ] : They’re used for specifying a character class, which
is a set of characters that you wish to match. Characters can be listed
individually, or a range of characters can be indicated by giving two characters
and separating them by a '-'. For example, [abc] will match any of the
characters a, b, or c; this is the same as [a-c], which uses a range to express
the same set of characters.
If you wanted to match only lowercase letters, your RE would be [a-z].
If you want to match digits between 2 to 7, the RE will be [2-7]
Metacharacter ^ : You can match the characters not listed within the class by
complementing the set. This is indicated by including a '^' as the first
character of the class; '^' outside a character class will simply match the '^'
character.
For example, [^5] will match any character except '5'.
Metacharacter  : Backslash is one of the most important metacharacter. The
backslash can be followed by various characters to signal various special
sequences. It’s also used to escape all the metacharacters so you can still
match them in patterns.
If you need to match a [ or , you can precede them with a backslash to remove
their special meaning: [ or . This will search for the [ character or the 
character.
Metacharacter . :
The . matches anything except a newline character, and there’s an alternate
mode (re.DOTALL) where it will match even a newline. '.' is often used where
you want to match “any character”. Example: ‘x.x’ will match ‘xxx’ and also
‘xyx’.
Special Sequences:
d Matches any decimal digit; this is equivalent to the class [0-9].
Š Prof Mukesh N Tekwani, 2016
3 / 6
D Matches any non-digit character; this is equivalent to the class [^0-9].
s Matches any whitespace character.
S Matches any non-whitespace character.
w Matches any alphanumeric character; this is equivalent to the class
[a-zA-Z0-9_].
W Matches any non-alphanumeric character; this is equivalent to the class
[^a-zA-Z0-9_].
3.3 The re package: search() and match()
The re package provides two methods to perform queries on an input string.
These methods are:
re.search() and re.match()
re.search() method:
Syntax of search():
re.search(pattern, string)
The description of parameters is as follows:
Parameter Description
pattern It is the regular expression to be matched
string This is the string, which would be searched to
match the pattern anywhere in the string.
The re.search function returns a match object on success, none on failure.
Program 1: RegEx1.py
Write a program to search if a pattern 'aa[bc]*dd' appears in the line
input by the user.
import sys, re
pat = re.compile('aa[bc]*dd')
while 1:
line = input('Enter a line ("q" to quit):')
if line == 'q':
Š Prof Mukesh N Tekwani, 2016
4 / 6
break
if pat.search(line):
print ('matched:', line)
else:
print ('no match:', line)
Analysis:
1. We import module re in order to use regular expressions.
2. re.compile() compiles a regular expression so that we can reuse the
compiled regular expression without compiling it repeatedly.
Output:
Enter a line ("q" to quit):aabcdd
matched: aabcdd
Enter a line ("q" to quit):abcd
no match: abcd
Enter a line ("q" to quit):aacd
no match: aacd
Enter a line ("q" to quit):aadd
matched: aadd
Enter a line ("q" to quit):aabcbcdd
matched: aabcbcdd
Enter a line ("q" to quit):aabcdddd
matched: aabcdddd
Enter a line ("q" to quit):q
>>>
Program 2: RegEx2.py
Write a program that searches for the occurrence of the pattern ‘A’ followed
by a single digit, followed by the pattern ‘bb’.
import sys, re
pat = re.compile('A[0-9]bb')
while 1:
line = input('Enter a line ("q" to quit):')
if line == 'q':
break
if pat.search(line):
print ('matched:', line)
else:
Š Prof Mukesh N Tekwani, 2016
5 / 6
print ('no match:', line)
In the above program, search is used to search a string and match the first string
from the left. search() searches for the pattern anywhere in the string.
Output:
Enter a line ("q" to quit):A65b
no match: A65b
Enter a line ("q" to quit):A65bb
no match: A65bb
Enter a line ("q" to quit):A6bb
matched: A6bb
Enter a line ("q" to quit):AA6bb
matched: AA6bb
Enter a line ("q" to quit):AA6bbc
matched: AA6bbc
Enter a line ("q" to quit):q
>>>
re.match() method:
Syntax of search():
re.match(pattern, string)
The description of parameters is as follows:
Parameter Description
pattern It is the regular expression to be matched
string This is the string, which would be searched to
match the pattern anywhere in the string.
flags You can specify different flags using bitwise OR (|).
The re.match function returns a match object on success, None on failure.
We use group(num) or groups() function of match object to get matched
expression.
group(num=0) This method returns entire match (or specific
subgroup num)
Š Prof Mukesh N Tekwani, 2016
6 / 6
groups() This method returns all matching subgroups in a tuple
** Example of using escape sequence: Start Python and type the
following two lines.
>>> name = 'AlbertnEinstein'
>>> print(name)
The output is as shown below. Note that the n character is a new line
character. This character is treated as a single character. This character
causes the remaining part to appear in the next line
Output:
Albert
Einstein
** Raw Strings: Raw strings are strings with escape characters disabled.
We have to add the character ‘r’ or ‘R’ as a prefix to our strings to make
them raw strings.
Modify the above example as follows:
>>> name = r'AlbertnEinstein'
>>> print (name)
AlbertnEinstein
>>>
Note that the n character had no effect in this case.
IMPORTANT QUESTIONS
1. What is a regular expression? Which module provides support for regular
expressions?
2. What is meant by the following: Literal characters, concatenated patterns,
alternate patterns, repeating and optional items, sets of characters.
3. What is a metacharacter? List the metacharacters used in Python. Explain
the following metacharacters: [ and ], ^,  and .
4. With an example, explain the search and match methods.
5. What is a raw string? Explain with a simple example.

Python - Regular Expressions

  • 1.
    © Prof MukeshN Tekwani, 2016 1 / 6 Unit I Chap 3 : Python – Regular Expressions 3.1 Concept of Regular Expression A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in text pattern matching, text extraction, and search-and-replace facility. Regular expressions are also called REs, or regexes or regex patterns. The module re provides full support for regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression. We can specify the rules for the set of possible strings that we want to match; this set might contain English sentences, or e-mail addresses, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways. The patterns or regular expressions can be defined as follows: ● Literal characters must match exactly. For example, "a" matches "a". ● Concatenated patterns match concatenated targets. For example, "ab" ("a" followed by "b") matches "ab". ● Alternate patterns (separated by a vertical bar) match either of the alternative patterns. For example, "(aaa)|(bbb)" will match either "aaa" or "bbb". ● Repeating and optional items: ○ "abc*" matches "ab" followed by zero or more occurrences of "c", for example, "ab", "abc", "abcc", etc. ○ "abc+" matches "ab" followed by one or more occurrences of "c", for example, "abc", "abcc", etc, but not "ab". ○ "abc?" matches "ab" followed by zero or one occurrences of "c", for example, "ab" or "abc". ● Sets of characters -- Characters and sequences of characters in square brackets form a set; a set matches any character in the set or range. For example, "[abc]" matches "a" or "b" or "c". And, for example, "[_a-z0- 9]" matches an underscore or any lower-case letter or any digit. ● Groups -- Parentheses indicate a group with a pattern. For example, "ab(cd)*ef" is a pattern that matches "ab" followed by any number of occurrences of "cd" followed by "ef", for example, "abef", "abcdef", "abcdcdef", etc. ● There are special names for some sets of characters, for example "d" (any digit), "w" (any alphanumeric character), "W" (any non- alphanumeric character), etc.
  • 2.
    © Prof MukeshN Tekwani, 2016 2 / 6 3.2 Metacharacters In forming a regular expression we use certain characters as metacharacters. These characters don’t match themselves but they indicate that some other thing should be matched. The complete list of metacharacters is: . ^ $ * + ? { } [ ] | ( ) Metacharacters [ and ] : They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z]. If you want to match digits between 2 to 7, the RE will be [2-7] Metacharacter ^ : You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class; '^' outside a character class will simply match the '^' character. For example, [^5] will match any character except '5'. Metacharacter : Backslash is one of the most important metacharacter. The backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns. If you need to match a [ or , you can precede them with a backslash to remove their special meaning: [ or . This will search for the [ character or the character. Metacharacter . : The . matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. '.' is often used where you want to match “any character”. Example: ‘x.x’ will match ‘xxx’ and also ‘xyx’. Special Sequences: d Matches any decimal digit; this is equivalent to the class [0-9].
  • 3.
    Š Prof MukeshN Tekwani, 2016 3 / 6 D Matches any non-digit character; this is equivalent to the class [^0-9]. s Matches any whitespace character. S Matches any non-whitespace character. w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]. W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]. 3.3 The re package: search() and match() The re package provides two methods to perform queries on an input string. These methods are: re.search() and re.match() re.search() method: Syntax of search(): re.search(pattern, string) The description of parameters is as follows: Parameter Description pattern It is the regular expression to be matched string This is the string, which would be searched to match the pattern anywhere in the string. The re.search function returns a match object on success, none on failure. Program 1: RegEx1.py Write a program to search if a pattern 'aa[bc]*dd' appears in the line input by the user. import sys, re pat = re.compile('aa[bc]*dd') while 1: line = input('Enter a line ("q" to quit):') if line == 'q':
  • 4.
    © Prof MukeshN Tekwani, 2016 4 / 6 break if pat.search(line): print ('matched:', line) else: print ('no match:', line) Analysis: 1. We import module re in order to use regular expressions. 2. re.compile() compiles a regular expression so that we can reuse the compiled regular expression without compiling it repeatedly. Output: Enter a line ("q" to quit):aabcdd matched: aabcdd Enter a line ("q" to quit):abcd no match: abcd Enter a line ("q" to quit):aacd no match: aacd Enter a line ("q" to quit):aadd matched: aadd Enter a line ("q" to quit):aabcbcdd matched: aabcbcdd Enter a line ("q" to quit):aabcdddd matched: aabcdddd Enter a line ("q" to quit):q >>> Program 2: RegEx2.py Write a program that searches for the occurrence of the pattern ‘A’ followed by a single digit, followed by the pattern ‘bb’. import sys, re pat = re.compile('A[0-9]bb') while 1: line = input('Enter a line ("q" to quit):') if line == 'q': break if pat.search(line): print ('matched:', line) else:
  • 5.
    Š Prof MukeshN Tekwani, 2016 5 / 6 print ('no match:', line) In the above program, search is used to search a string and match the first string from the left. search() searches for the pattern anywhere in the string. Output: Enter a line ("q" to quit):A65b no match: A65b Enter a line ("q" to quit):A65bb no match: A65bb Enter a line ("q" to quit):A6bb matched: A6bb Enter a line ("q" to quit):AA6bb matched: AA6bb Enter a line ("q" to quit):AA6bbc matched: AA6bbc Enter a line ("q" to quit):q >>> re.match() method: Syntax of search(): re.match(pattern, string) The description of parameters is as follows: Parameter Description pattern It is the regular expression to be matched string This is the string, which would be searched to match the pattern anywhere in the string. flags You can specify different flags using bitwise OR (|). The re.match function returns a match object on success, None on failure. We use group(num) or groups() function of match object to get matched expression. group(num=0) This method returns entire match (or specific subgroup num)
  • 6.
    © Prof MukeshN Tekwani, 2016 6 / 6 groups() This method returns all matching subgroups in a tuple ** Example of using escape sequence: Start Python and type the following two lines. >>> name = 'AlbertnEinstein' >>> print(name) The output is as shown below. Note that the n character is a new line character. This character is treated as a single character. This character causes the remaining part to appear in the next line Output: Albert Einstein ** Raw Strings: Raw strings are strings with escape characters disabled. We have to add the character ‘r’ or ‘R’ as a prefix to our strings to make them raw strings. Modify the above example as follows: >>> name = r'AlbertnEinstein' >>> print (name) AlbertnEinstein >>> Note that the n character had no effect in this case. IMPORTANT QUESTIONS 1. What is a regular expression? Which module provides support for regular expressions? 2. What is meant by the following: Literal characters, concatenated patterns, alternate patterns, repeating and optional items, sets of characters. 3. What is a metacharacter? List the metacharacters used in Python. Explain the following metacharacters: [ and ], ^, and . 4. With an example, explain the search and match methods. 5. What is a raw string? Explain with a simple example.