Your SlideShare is downloading. ×
Data-Mining the Web Using Perl
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data-Mining the Web Using Perl

1,627
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,627
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data-Mining the Web Using Perl Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University
  • 2. Data-Mining the Web
    • Examples
      • Election Returns in Luxembourg
        • Luxembourg Official Election Results, 2004
        • http:// qssi.psu.edu/files/luxembourg.pl
      • Parliamentary Speech
        • The Congressional Record
  • 3. How’d You Do That?
    • There are several programming languages with “straightforward” facilities for doing this. Most notably,
      • Perl
      • Python
      • Java
    • I’m going to talk about Perl, because
      • it’s the most established
      • it’s the one I know
    • It appears that Python may be preferable, but that’s for someone else to say.
  • 4. What’s Perl?
    • Open source (free / flexible / extensible / a little wild and woolly – like Linux, R) programming language.
    • It is very very good at processing text.
      • note, webpages are just texts.
      • note, datasets (like a flat spreadsheet or Stata file) are just texts.
      • Social scientists might have some use for turning one into the other, no?
    • It has very useful facilities for building
      • Spiders
      • Scrapers
      • (and “agents”, “robots”, “crawlers”, etc.)
  • 5. What’s a Spider?
    • A spider is a program designed to automatically gather webpages.
    • If, for example, you want to automatically download all of the speeches delivered in Congress today – without manually clicking on every one, cutting and pasting, etc. – you might want to build a spider.
  • 6. What’s a scraper?
    • A scraper (or “screen-scraper”) extracts the information you want – whatever you consider to be data – from a given webpage.
    • If you want to know who said “health” and how many times, you might want to build a scraper.
  • 7. BEWARE!
    • Spiders (and other similar types of programs – “robots”, “crawlers”) can be put to nefarious use:
      • appropriating copyrighted materials
      • extracting email addresses for spammers
      • overwhelming servers to create “denial of service”
      • generally violating a site’s “terms of service” or “acceptable use policy”
    • If you are not careful to use legal and ethical good practices, you can
      • be denied access to a website altogether
      • get yourself or the university sued or even subjected to criminal penalties
  • 8. Perl
    • Open-source
    • Cross-platform
      • (Windows – I recommend “ActivePerl” from http://www.activestate.com )
    • There are many websites with resources:
      • http://www.cpan.org (Comprehensive Perl Archive Network)
      • http://www.perlmonks.org (PerlMonks)
      • http://www.perl.org
      • http://perl.oreilly.com (O’Reilly Publishing)
    • Lots of mailing lists, etc.
  • 9. Books
    • Basics of Perl
      • The best books are put out by O’Reilly Publishing and are generally known by the animal on the cover.
      • Learning Perl (the Llama)
        • or, Learning Perl on Win32 Systems (the Gecko)
      • Programming Perl (the Camel)
    • Web-mining
      • Perl & LWP (the Blesbok, apparently)
      • Spidering Hacks
    • These books, and some others, are or will be available in the “QuaSSI Library” (in Pond 216).
  • 10. Running Perl
    • For machines with approved ActivePerl installations in Pond ...
      • Perl is located in c:/Perl/
    • For today,
      • we will operate entirely in the directory c:/Perl/eg/
      • To get there,
        • open Programs -> Accessories -> Command Prompt
        • At the prompt, type c:
        • Type cd Perl/eg
    • (In your particular installation, or in a Mac, or something like Unix on high performance computing, these details will be different.)
  • 11. The First Perl Program
    • Go to the QuaSSI Website for the example scripts for todays workshop:
      • http:// qssi.psu.edu/files/howdy.pl
    • Right-click on the first script, “howdy.pl”, and save it to c:Perleg
    • Open up the text-editor WinEdt (you could use almost anything) and then open howdy.pl
    • That’s a complete Perl program.
    • Note: that’s all a program is – a text file.
  • 12. Running a Perl Program
    • Go back to your command prompt.
    • Type perl howdy.pl –w
    • (The –w tells perl to give you w arnings about what might be wrong if the program is broken.)
  • 13. Modifying a program
    • Go back to WinEdt
    • Edit the text between the quotation marks to say something new
    • Click File -> Save
    • Go back to the command prompt
    • Hit the up arrow (to get the last command, perl howdy.pl –w
    • Look at that – you’re a programmer!
  • 14. Break the program
    • Go back to WinEdt
    • Delete the semicolon at the end of the line
    • Save the file
    • Go back to the command prompt and run the program, with –w , again
    • What happened?
  • 15. Perl at 30,000 feet
    • Much of the next set of slides is stolen shamelessly from Andy Tester’s “Perl at 10,000 Feet” at www.petdance.com
    • (I’m skipping even more than he did.)
  • 16. Some generalities about Perl
    • Statements in Perl are, or usually can be, constructed in a fairly natural English-like way.
    • There are many ways to do any one thing.
    • The syntax can be offputting and hard to read, especially at first. It is easy to “obfuscate” Perl code and this is sometimes done intentionally.
    • Main syntax rule: end all lines with ;
  • 17. Data Types
    • Scalars
    • Arrays and Lists
    • Hashes
    • References
    • Filehandles
    • Objects
  • 18. Scalars
    • Numbers
      • Generally decimal floating point
      • (Can be made integer, octal, hexadecimal)
    • Strings
      • Can contain any character
      • Can be null: “”
      • Can be arbitrarily large
  • 19. Strings
    • Single-quoted
      • characters are as shown with only two exceptions.
        • single-quote in a single-quoted string requires ’
        • backslash in a single-quoted string requires
    • Double-quoted
      • it will interpolate – calculate variables or control sequences.
        • For example
          • $foo = “myfile”;
          • $datafile = “$foo.txt”;
          • will result in the variable $datafile holding the string “myfile.txt”
        • Another example
          • print ‘Howdy ’; will print:
            • Howdy
          • print “Howdy ”; will print
            • Howdy
          • ( is a control sequence, standing for “new line”).
  • 20. Scalar operators
    • Math
      • *, /, % (for modulo), ** (for exponentiation), etc.
    • Strings
      • x to repeat the thing on the left
        • “ b” x 10 gives “bbbbbbbbbb”
      • . concatenates strings
        • (“na” x 16).“ Batman!” gives ...
    • Perl knows to convert when mixing these two types:
      • “ 3”*4 gives 12
      • “ 3”.4 gives “34”
  • 21. Comparing Scalars
    • Comparison Numeric String
    • Equal == eq
    • Not equal != ne
    • Less than < lt
    • Greater than > gt
    • Less / equal <= le
    • Greater / equal >= ge
    • 8 < 25 TRUE!
    • “ 8” lt “25” FALSE!
  • 22. Variables
    • A sign, followed by a letter, followed by pretty much whatever.
    • Sign determines the type:
      • $foo is a scalar
      • @foo is a list
      • %foo is a hash
    • Variables default to global (they apply in all parts of your program). This can be problematic.
      • local $var will make the variable active only for the current “block” of code.
      • my $var does the same, and is the more usual construction.
      • the very common use strict ; at the beginning of code forces good practice in the use of local variables (creates more syntax errors, but prevents more whoppers that could blow everything up.)
  • 23. Lists and Arrays
    • A list is an ordered set of (usually) scalars.
    • An array is a variable holding a list.
    • my @foo = (1,2,3)
    • my @bar = (“elephant”, 3.14)
    • Can be constructed as lists of scalar variables:
      • my @data = ($name, $address, $SSN)
  • 24. Using Arrays
    • Elements are indexed, from 0.
      • my @animals = (“frog”, “bear”, “elephant”);
      • print $animals[2]; # prints elephant
      • Note: element is a scalar, so $ rather than @
    • Subsections are “slices”.
      • my @mammals = @animals[1,2];
    • Lots of functions for
      • using as a stack (moving things on and off the right or left side of the array).
      • sorting
      • joining two arrays
      • splitting a scalar string into an array
        • my $sentence = “This is my sentence.”;
        • my @words = split(“ “, $sentence);
        • # now @words contains (“This”, “is”, “my”, “sentence”);
  • 25. Programming Controls
    • Control structures
      • if / then / elsif / else
      • while
      • do {} while
      • do {} until
      • for ()
      • foreach() # loops over a list
    • Errors / warnings
      • die “message” kills program and prints “message”.
      • warn “message” prints message and keeps going.
  • 26. Hashes
    • “ Associative arrays”
    • A set of
      • values (any scalar), indexed by
      • keys (strings)
    • Example
      • my %info;
      • $info{ “name” } = “Burt Monroe”;
      • $info{ “age” } = 39;
    • With hashes and arrays you can create almost any arbitrary data structure (even arrays of arrays, arrays of hashes, hashes of arrays, etc.)
  • 27. File Handling
    • open() function opens a file for processing.
    • Prefix the filename to define how
      • “ <“ for input from existing file (read)
      • “ >” to create for output (write)
      • “ >>” to append to a file (that may not yet exist)
    • open (IN, “<myfile.txt”) or die “Can’t open myfile.txt”;
    • Can then use <> to refer to the file. The above would be <IN>.
  • 28. Matching string patterns using regular expressions
    • This is where much of the power of Perl lies.
    • m/pattern/ will check the last stored variable ( $_ ) for pattern.
    • $var =~ m/pattern/; will check $var for pattern.
    • If the pattern is in $var, then
      • $var =~ m/pattern/ is TRUE.
    • If you “group” part of the pattern and it is present,
      • $var =~ m/(pattern)/ is true, AND, now a variable names $1 contains the first match it found.
      • Group more pieces of the pattern and the matches are stored in $2, $3, etc.
    • This only grabs the *first* match. To grab all, say
      • my @matches = ($var =~ m/(pattern)/g);
      • This will store every match in the array @matches.
  • 29. What’s a “regular expression”?
    • Combination of
      • any literal character, number, etc.
      • . any single character
      • * zero or more of the previous
      • + one or more of the previous
      • ? zero or one of the previous
      • [aeiou] character class – this is the vowels
      • ^ beginning of the line
      • $ end of the line
      •  word boundary
      • d D digit / non-digit
      • s S space / non-space
      • w W word character / non-word character
      • | or – match this or that
      • () grouping
    • See handout for more.
  • 30. Examples
    • Romeo|Juliet “Romeo” or “Juliet”
    • ddd-dddd a phone number
    • (ddd-)?ddd-dddd phone #, maybe w/ area
    • [aeiou]w+ a word starting w/ a vowel
    • [A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4} email add.
  • 31. Modules
    • Hundreds of modules / packages available through cpan.
    • ActivePerl gives a GUI for installing them in its “Perl Package Manager”.
  • 32. A basic Perl example
    • Counting words.
      • counter1.pl
  • 33. Grabbing from the web
    • The basic idea is simply to have Perl act as an “agent”, in the way a browser like Explorer or Firefox does -- requesting and interpreting webpages.
    • There are a few basic modules that can do this.
  • 34. LWP::Simple
    • lwpsimpleget.pl
  • 35. LWP::UserAgent
    • More elaborate than LWP::Simple.
    • I’m going to skip that one today, but it’s covered in details in the main books
      • Perl & LWP
      • Spidering Hacks
    • Pretty much all of the functionality has been wrapped more intuitively into ...
  • 36. WWW::Mechanize
    • mechanizeget.pl
  • 37. Scraping
    • At its base, this is just extracting information from the page(s) you download.
    • Simple example:
      • freshair.pl
  • 38. Your agent can interact ...
    • For example, what if the webpage involves a form ...
    • Example
      • abstracts.pl
    • You can authenticate with username and password, run through proxy servers, and so on.
  • 39. Spiders
    • Type 1 Requester
      • Requests a few items with known urls from a website.
    • Type 2 Requester
      • Requests a few items, then requests (some set of) pages to which those items link.
    • Type 3 Requester
      • Starts at a given url, and then requests everything linked, everything linked by that, etc. at the same host server . The idea here is usually to download an entire website.
    • Type 4 Requester
      • Starts at a given url, requests everything linked anywhere , everything linked by that, etc. until it, perhaps, visits the entire web.
    • YOU – I am talking to YOU – in all likelihood have no business writing Type 3 or Type 4 spiders. These can easily go seriously awry causing mayhem of many sorts. Write only spiders with known finite scope.
  • 40. Back to the Luxembourg Miner
    • Commune-level election results from Luxembourg.
      • luxembourg.pl
  • 41. More on Scraping
    • All of the examples scraped / parsed using regular expressions.
    • More structured data like HTML is often better (or only) addressed with more specialized tools:
      • HTML::TokeParser
      • HTML::TreeBuilder
    • There are modules for scraping from XML, spreadsheets, databases, Word docs, PDFs.