Regex lecture


Published on

PT.BUZOO INDONESIA is No1 Japanese offshore development company in Indonesia.
We are professional of web solution and smartphone apps. We can support Japanese, English and Indonesia.
We are hiring now at

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Regex lecture

  2. 2. What is Regular Expressions?  Regular Expressions or Regex (We’ll mostly use Regex to call it in this presentation) are a powerful tool for examining and modifying text.  Regex use general pattern notation to allow you describe and parse text.  PHP supports two different types of regular expressions: POSIX-extended and Perl- Compatible Regular Expressions (PCRE). But we’ll focus on PCRE in this lecture.
  3. 3. Delimiters  When using PCRE functions we need to enclose the pattern using delimiters.  Often used delimiters are forward slashes (/), hash signs (#) and tildes (~ ).  Example of usage :  /([^/ | ^-]+).html/  /</span>(.*?)</span>/
  4. 4. Literal-Characters  Literal characters are normal characters that match themselves. Alphanumeric characters and symbols are example of literal characters  To difference between Meta-Characters and Literal-Characters we need to add backslash () before the literal character to define that character is a literal character not a meta character
  5. 5. Meta-characters  Meta-characters are the main power of regular expressions, with meta-characters it’s possible to encode alternatives and repetitions in the pattern.  Meta-characters are divided into two type, meta- characters outside class, and meta-characters inside class.
  6. 6. Meta-characters Cont’d  Here is list of meta-character that can work outside a class :  , ^ , $ , . , [ , ] , | , ( , ) , ? , * , + , { , }  And this is the list of meta-character that work inside a class :  , ^ , -
  7. 7. Character Classes  Character classes in Regex started by opening square bracket ([) and closed by and closing square bracket (])  A character class matches a single character in the subject; the character must be in the set of characters defined by the class.  Example :  [a-z] will match any lowercase letter  [^A-Z] will match a ny character that is not a uppercase letter
  8. 8. Subpatterns  Subpatterns are delimited by parentheses (round brackets), which can be nested.  Subpatterns can do two things : 1. It localizes a set of alternatives. For example, the pattern hen(dy|rio|ri) matches one of the words “hendy", “henrio", or “henri". Without the parentheses, it would match “hendy", “rio" or the “ri”. 2. It sets up the subpattern as a capturing subpattern (as defined above).
  9. 9. Subpatterns Cont’d  For example, if the string “kafji tinggi" is matched against the pattern ((kafji|niko) (tinggi|tampan)) the captured substrings are “kafji tinggi", “kafji", and “tinggi", and are numbered 1, 2, and 3.  There are often times we don’t need capturing functions. In that case we can add "?:“ after the opening parenthesis.
  10. 10. Optional Items  The question mark makes the preceding token in the regular expression optional.  Example : colou?r will match both colour and color.  You can also wrap a set of characters in parenthesis to make them optional.  Example : Jan(uary)? will match both Jan and January.
  11. 11. Repetition  There are two repetition characters, star ( * ) and plus ( + ).  Star ( * ) character will try to match the preceding token zero or more times.  Plus ( + ) character will try to match the preceding token one or more times  Example :  [sS]+ will match any character one or more  [sS]* will match any character zero or more
  12. 12. Limiting Repetition  Sometimes we need to limit some repetition, to achieve that we can use { } bracket.  The syntax is {min,max} where min is a must and you can empty the max but it’ll be counted as infinity, and if you omit both the coma and max it’ll repeat the token exactly min times.  Example :  ([A-Z]{3}|[0-9]{4}) will matches three letters or four numbers
  13. 13. Greediness  Greediness is a condition where the regex given to option try to match the pattern or not to match the pattern.  But the regex will always try to match the pattern. It can cause some trouble to us and will return an unexpected result.  For example the regex Feb 23(rd)? to the string Today is Feb 23rd, 2003, the match will always be Feb 23rd and not Feb 23.
  14. 14. Greediness Cont’d  Example for repetition :  You want to get HTML tag for crawling a website. Usually new people will use <.+> to match the HTML tag. But it will return a different result than you expected. Let’s try to match that pattern with this string -> “Saya <b>suka</b> makan”  The result will be <b>suka</b>  Why?
  15. 15. Greediness Cont’d  That’s because of greediness, the pattern <.+> will try to match dot ( . ) as many as possible.  Let’s try to do it step by step.  First the regex will try to search < from this string “Saya <b>suka</b> makan” so Saya will be skipped.  Then after finding < it’ll try to run (.+) that means to find any character one or more so it’ll read from b until the end of string. Then it’ll backtracking until the last > character that have been found so the result will be <b>suka</b> not <b> and </b>
  16. 16. Laziness  How to fix greediness problem? You can use laziness by adding ? Question mark after the repetition or question mark to make them lazy  But there is also another alternative to laziness that is negated character class.  Example for previous question :  <[^>]+> will match anything except > character