regular expressions and the world wide web

418 views

Published on

Importance of regular expressions on the web

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
418
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

regular expressions and the world wide web

  1. 1. REGULAR EXPRESSIONS, EXTRAORDINARY POWERUNSL2013Burdisso Sergio - sergio.burdisso@gmail.com
  2. 2.  I have 20 min to cover all about using REs ontheW3
  3. 3.  HTTP Internet bots Web Crawler Web Scraping
  4. 4. HyperText Transfer Protocol
  5. 5. WWW (The Web)Web BrowserRequestResponseHTTPHTTP
  6. 6.  Application layer protocol HTTP is the protocol to exchange or transfer hypertextHttp documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.htmlsequences of characters
  7. 7.  HTTP Response exampleHeaderBody
  8. 8. EXTRAORDINARY POWER
  9. 9.  FirstThings First…Regular ExpressionsAreAwesome! Gather text Replace /Transform text Search /Validate text
  10. 10.  POSIX regular expressions (standard)▪ ^. [ ] [^ ] (0) * {m,n} ? +|$ regex.h pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})" regcomp(regex_t *regex, pattern, cflags); regex.re_nsub = 4 //Number of parenthesized subexpressions regexec(regex, text, pmatch[]) pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
  11. 11.  Making use of RE to parse HTTP responses headers
  12. 12. Great! Now we’re able to parse the http response headers… so what?-We can properly process the response body!Ah I see! … and what would I do that for?-Let me show you!
  13. 13. Just like spiders on the web!
  14. 14. Regular Expressions cartoon from xkcdWeb Scraping(we will see!)
  15. 15.  Internet bots (web robots,WWW robots orbots) are software applications that runautomated tasks over the Internet A Web crawler is an Internet bot thatsystematically browses theWorld Wide Web,typically for the purpose ofWeb indexing Web scraping is a computer software techniqueof extracting information from websites
  16. 16.  A Web Crawler Starts with a list of URLs to visit. As thecrawler visits these URLs, it identifies allthe hyperlinks in the page and adds them to the list ofURLs to visithyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)"");hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
  17. 17.  Web Scraping: A simple yet powerful approach toextract information from web pages can be based onregular expression matching facilities of programminglanguages (for instance C++, Perl or Python)
  18. 18. Regular Expressions cartoon from xkcdWebScraping wScraping (8, "http://emails.com/victim");wScraping.findAll("^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Boxd{1,5}))s{1,2}(?i:(?<address2>(((APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?<city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}),x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$");We’ve saved the day!
  19. 19. Everybody stand back!We know regular expressionsThe endThank you for your patience!

×