regular expressions and the world wide web

  • 82 views
Uploaded on

Importance of regular expressions on the web

Importance of regular expressions on the web

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
82
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. REGULAR EXPRESSIONS, EXTRAORDINARY POWERUNSL2013Burdisso Sergio - sergio.burdisso@gmail.com
  • 2.  I have 20 min to cover all about using REs ontheW3
  • 3.  HTTP Internet bots Web Crawler Web Scraping
  • 4. HyperText Transfer Protocol
  • 5. WWW (The Web)Web BrowserRequestResponseHTTPHTTP
  • 6.  Application layer protocol HTTP is the protocol to exchange or transfer hypertextHttp documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.htmlsequences of characters
  • 7.  HTTP Response exampleHeaderBody
  • 8. EXTRAORDINARY POWER
  • 9.  FirstThings First…Regular ExpressionsAreAwesome! Gather text Replace /Transform text Search /Validate text
  • 10.  POSIX regular expressions (standard)▪ ^. [ ] [^ ] (0) * {m,n} ? +|$ regex.h pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})" regcomp(regex_t *regex, pattern, cflags); regex.re_nsub = 4 //Number of parenthesized subexpressions regexec(regex, text, pmatch[]) pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
  • 11.  Making use of RE to parse HTTP responses headers
  • 12. Great! Now we’re able to parse the http response headers… so what?-We can properly process the response body!Ah I see! … and what would I do that for?-Let me show you!
  • 13. Just like spiders on the web!
  • 14. Regular Expressions cartoon from xkcdWeb Scraping(we will see!)
  • 15.  Internet bots (web robots,WWW robots orbots) are software applications that runautomated tasks over the Internet A Web crawler is an Internet bot thatsystematically browses theWorld Wide Web,typically for the purpose ofWeb indexing Web scraping is a computer software techniqueof extracting information from websites
  • 16.  A Web Crawler Starts with a list of URLs to visit. As thecrawler visits these URLs, it identifies allthe hyperlinks in the page and adds them to the list ofURLs to visithyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)"");hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
  • 17.  Web Scraping: A simple yet powerful approach toextract information from web pages can be based onregular expression matching facilities of programminglanguages (for instance C++, Perl or Python)
  • 18. Regular Expressions cartoon from xkcdWebScraping wScraping (8, "http://emails.com/victim");wScraping.findAll("^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Boxd{1,5}))s{1,2}(?i:(?<address2>(((APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?<city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}),x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$");We’ve saved the day!
  • 19. Everybody stand back!We know regular expressionsThe endThank you for your patience!