Professional Web Scraping
           Yung-chung Lin
      henearkrxern@gmail.com
Common problems

• Regular expressions fail with layout changes
• Too many output formats
• Too much time in writing a sit...
Web scraping
 process
Need a better
 framework?
F-E-P
Fetching-Extracting-Processing
Benefits


• Tasks are separated into 3 phases (fetching,
  extracting, and processing).
• No more all-in-one and messy scr...
A sample scraper
Fetching


Extracting


Processing



 Fetching
34 lines and it works!
Fetching
# based on WWW::Mechanize

my $m = $self->{m};

$m->get($url);
$m->find_link();
$m->links();
But much more ...
# is the page cached on disk?
$m->is_cached();

# is the link visited?
$m->is_visited();

# selects elem...
Extracting syntax
# very much like jQuery syntax
query(‘div.class’), query(‘div.id’), query(‘title’)

# selecting siblings...
Data processing
# outputs SQL statements
$self->handle_one_item($r,
       { name => 'D::LinebylineSQL',
         new => {...
Other features
•   Document caching
    •   Politeness to websites
•   Logging
    •   Every action is recorded
•   Scrapi...
Thank you
Professional Site Scraping
Upcoming SlideShare
Loading in...5
×

Professional Site Scraping

416

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
416
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Professional Site Scraping

  1. 1. Professional Web Scraping Yung-chung Lin henearkrxern@gmail.com
  2. 2. Common problems • Regular expressions fail with layout changes • Too many output formats • Too much time in writing a site scraper
  3. 3. Web scraping process
  4. 4. Need a better framework?
  5. 5. F-E-P
  6. 6. Fetching-Extracting-Processing
  7. 7. Benefits • Tasks are separated into 3 phases (fetching, extracting, and processing). • No more all-in-one and messy scripts
  8. 8. A sample scraper
  9. 9. Fetching Extracting Processing Fetching
  10. 10. 34 lines and it works!
  11. 11. Fetching # based on WWW::Mechanize my $m = $self->{m}; $m->get($url); $m->find_link(); $m->links();
  12. 12. But much more ... # is the page cached on disk? $m->is_cached(); # is the link visited? $m->is_visited(); # selects elements from a web page $m->query_first(‘title’)->as_text; $m->query_first(‘div#summary’)->innerHTML;
  13. 13. Extracting syntax # very much like jQuery syntax query(‘div.class’), query(‘div.id’), query(‘title’) # selecting siblings and parents query(‘div:right’), query(‘div:left’), query(‘div:upper’) # selecting elements with some words query(‘div:contains(words)’)
  14. 14. Data processing # outputs SQL statements $self->handle_one_item($r, { name => 'D::LinebylineSQL', new => { fields => [ qw(ID URL Summary) ] } }); # outputs CSV files $self->handle_one_item($r, { name => 'D::CSV', new => { fields => [ qw(ID URL Summary) ] } });
  15. 15. Other features • Document caching • Politeness to websites • Logging • Every action is recorded • Scraping history • No duplicates • And much more ...
  16. 16. Thank you
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×