Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Crawlers in Perl

55 views

Published on

Use Perl to write an automated script that peruses web pages, sucking in data and processing it into more useful data.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Web Crawlers in Perl

  1. 1. Web Crawlers in Perl Presented by Lambert Lum
  2. 2. Terminology Web Crawling: larger scale Screen Scraping: smaller scale Here we'll use both terms interchangeably
  3. 3. What we don't cover No database No data warehousing No parallel processing No asynchronous coding
  4. 4. Useful for you? Just for google wannabe? Just for search engine engineers?
  5. 5. cpanminus Uses screen scraping to get data from cpan.org
  6. 6. WWW::Mechanize Inherits from LWP::UserAgent Ported to other languages Strangely missing in PHP
  7. 7. WWW::Mechanize (basic) my $mech = WWW::Mechanize->new(); $mech->get("http://sfbay.craigslist.org/ela"); my $content = $mech->content; print $content;
  8. 8. WWW::Mechanize (regex) my $mech = WWW::Mechanize->new(); $mech->get("http://sfbay.craigslist.org/ela"); my $content = $mech->content; my ($h4_text) = $content =~ m{<h4.*?>(.*?)</h4>}; print "$h4_textn";
  9. 9. HTML::TreeBuilder my $mech = WWW::Mechanize->new(); $mech->get("http://sfbay.craigslist.org/ela"); my $content = $mech->content; my $tree = HTML::TreeBuilder->new(); $tree->parse_content($content);
  10. 10. HTML::TreeBuilder my $elt = $tree->look_down ( _tag => 'h4', class => 'ban', ); print "h4: " . $elt->as_text . "n";
  11. 11. HTML::TreeBuilder [use firebug]
  12. 12. HTML::TreeBuilder alternatives Web::Scraper HTML::TreeBuilder::XPath
  13. 13. Cached Mechanize my $dir = "data/$ela"; my $content; eval { $content = read_file ("$dir/index.html"); }; if (!$content) { my $url = "http://$ela"; $mech->get($url); $content = $mech->content; make_path ($dir); write_file ("$dir/index.html", $content); print "wrote $dir/index.htmln"; }
  14. 14. Other caching WWW::Mechanize::Plugin::Cache WWW::Mechanize::Cached
  15. 15. Form submission my $mech = WWW::Mechanize->new(); $mech->get("http://sfbay.craigslist.org/ela/"); $mech->field ('catAbb', 'ela'); $mech->field ('query', 'playstation'); $mech->field ('maxAsk', 300); $mech->submit(); my @links = $mech->find_all_links( text_regex => qr{playstation}i, ); print join "", map { $_->text() . "n" } @links;
  16. 16. Form submission (2) my $mech = WWW::Mechanize->new(); $mech->get("http://sfbay.craigslist.org/ela/"); my @forms = $mech->forms; my $form = $forms[0]; my $action = $form->action; my @inputs = $form->inputs; my @names = $form->param;
  17. 17. Follow Next Link my $mech = WWW::Mechanize->new(); my $url = "http://sfbay.craigslist.org/ela"; $mech->get($url); my $uri = $mech->uri; print "uri: $urin"; my $i = 0; while ($i < 10 && $mech->follow_link (text => 'next >')) { #print Dumper $link; $uri = $mech->uri; print "uri: $urin"; $i++; }
  18. 18. Other uses Test::WWW::Mechanize
  19. 19. Other uses Link checking
  20. 20. Legality I'm not a lawyer User agreements may object to screen scraping Ebay has sued a notorious screen scraper Online-games will almost always ban you
  21. 21. No DDoS Be considerate. Don't hit the server like a DDoS attack
  22. 22. JavaScript parsing HTML parsers are easy. JavaScript parsers are hard.
  23. 23. JavaScript parsing Selenium lets you hijack your FireFox web browser Headless WebKit (PhantomJS/Wight) – WebKit is the base for Chrome/Safari
  24. 24. Homework Crawl every page of modernperlbooks.com, extracting title and 1st paragraph

×