• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Professional Site Scraping
 

Professional Site Scraping

on

  • 530 views

 

Statistics

Views

Total Views
530
Views on SlideShare
530
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Professional Site Scraping Professional Site Scraping Presentation Transcript

    • Professional Web Scraping Yung-chung Lin henearkrxern@gmail.com
    • Common problems • Regular expressions fail with layout changes • Too many output formats • Too much time in writing a site scraper
    • Web scraping process
    • Need a better framework?
    • F-E-P
    • Fetching-Extracting-Processing
    • Benefits • Tasks are separated into 3 phases (fetching, extracting, and processing). • No more all-in-one and messy scripts
    • A sample scraper
    • Fetching Extracting Processing Fetching
    • 34 lines and it works!
    • Fetching # based on WWW::Mechanize my $m = $self->{m}; $m->get($url); $m->find_link(); $m->links();
    • But much more ... # is the page cached on disk? $m->is_cached(); # is the link visited? $m->is_visited(); # selects elements from a web page $m->query_first(‘title’)->as_text; $m->query_first(‘div#summary’)->innerHTML;
    • Extracting syntax # very much like jQuery syntax query(‘div.class’), query(‘div.id’), query(‘title’) # selecting siblings and parents query(‘div:right’), query(‘div:left’), query(‘div:upper’) # selecting elements with some words query(‘div:contains(words)’)
    • Data processing # outputs SQL statements $self->handle_one_item($r, { name => 'D::LinebylineSQL', new => { fields => [ qw(ID URL Summary) ] } }); # outputs CSV files $self->handle_one_item($r, { name => 'D::CSV', new => { fields => [ qw(ID URL Summary) ] } });
    • Other features • Document caching • Politeness to websites • Logging • Every action is recorded • Scraping history • No duplicates • And much more ...
    • Thank you