Your SlideShare is downloading. ×
0
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Data-Mining the Web Using Perl
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data-Mining the Web Using Perl

1,671

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,671
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
35
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data-Mining the Web Using Perl Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University
  • 2. Data-Mining the Web <ul><li>Examples </li></ul><ul><ul><li>Election Returns in Luxembourg </li></ul></ul><ul><ul><ul><li>Luxembourg Official Election Results, 2004 </li></ul></ul></ul><ul><ul><ul><li>http:// qssi.psu.edu/files/luxembourg.pl </li></ul></ul></ul><ul><ul><li>Parliamentary Speech </li></ul></ul><ul><ul><ul><li>The Congressional Record </li></ul></ul></ul>
  • 3. How’d You Do That? <ul><li>There are several programming languages with “straightforward” facilities for doing this. Most notably, </li></ul><ul><ul><li>Perl </li></ul></ul><ul><ul><li>Python </li></ul></ul><ul><ul><li>Java </li></ul></ul><ul><li>I’m going to talk about Perl, because </li></ul><ul><ul><li>it’s the most established </li></ul></ul><ul><ul><li>it’s the one I know </li></ul></ul><ul><li>It appears that Python may be preferable, but that’s for someone else to say. </li></ul>
  • 4. What’s Perl? <ul><li>Open source (free / flexible / extensible / a little wild and woolly – like Linux, R) programming language. </li></ul><ul><li>It is very very good at processing text. </li></ul><ul><ul><li>note, webpages are just texts. </li></ul></ul><ul><ul><li>note, datasets (like a flat spreadsheet or Stata file) are just texts. </li></ul></ul><ul><ul><li>Social scientists might have some use for turning one into the other, no? </li></ul></ul><ul><li>It has very useful facilities for building </li></ul><ul><ul><li>Spiders </li></ul></ul><ul><ul><li>Scrapers </li></ul></ul><ul><ul><li>(and “agents”, “robots”, “crawlers”, etc.) </li></ul></ul>
  • 5. What’s a Spider? <ul><li>A spider is a program designed to automatically gather webpages. </li></ul><ul><li>If, for example, you want to automatically download all of the speeches delivered in Congress today – without manually clicking on every one, cutting and pasting, etc. – you might want to build a spider. </li></ul>
  • 6. What’s a scraper? <ul><li>A scraper (or “screen-scraper”) extracts the information you want – whatever you consider to be data – from a given webpage. </li></ul><ul><li>If you want to know who said “health” and how many times, you might want to build a scraper. </li></ul>
  • 7. BEWARE! <ul><li>Spiders (and other similar types of programs – “robots”, “crawlers”) can be put to nefarious use: </li></ul><ul><ul><li>appropriating copyrighted materials </li></ul></ul><ul><ul><li>extracting email addresses for spammers </li></ul></ul><ul><ul><li>overwhelming servers to create “denial of service” </li></ul></ul><ul><ul><li>generally violating a site’s “terms of service” or “acceptable use policy” </li></ul></ul><ul><li>If you are not careful to use legal and ethical good practices, you can </li></ul><ul><ul><li>be denied access to a website altogether </li></ul></ul><ul><ul><li>get yourself or the university sued or even subjected to criminal penalties </li></ul></ul>
  • 8. Perl <ul><li>Open-source </li></ul><ul><li>Cross-platform </li></ul><ul><ul><li>(Windows – I recommend “ActivePerl” from http://www.activestate.com ) </li></ul></ul><ul><li>There are many websites with resources: </li></ul><ul><ul><li>http://www.cpan.org (Comprehensive Perl Archive Network) </li></ul></ul><ul><ul><li>http://www.perlmonks.org (PerlMonks) </li></ul></ul><ul><ul><li>http://www.perl.org </li></ul></ul><ul><ul><li>http://perl.oreilly.com (O’Reilly Publishing) </li></ul></ul><ul><li>Lots of mailing lists, etc. </li></ul>
  • 9. Books <ul><li>Basics of Perl </li></ul><ul><ul><li>The best books are put out by O’Reilly Publishing and are generally known by the animal on the cover. </li></ul></ul><ul><ul><li>Learning Perl (the Llama) </li></ul></ul><ul><ul><ul><li>or, Learning Perl on Win32 Systems (the Gecko) </li></ul></ul></ul><ul><ul><li>Programming Perl (the Camel) </li></ul></ul><ul><li>Web-mining </li></ul><ul><ul><li>Perl & LWP (the Blesbok, apparently) </li></ul></ul><ul><ul><li>Spidering Hacks </li></ul></ul><ul><li>These books, and some others, are or will be available in the “QuaSSI Library” (in Pond 216). </li></ul>
  • 10. Running Perl <ul><li>For machines with approved ActivePerl installations in Pond ... </li></ul><ul><ul><li>Perl is located in c:/Perl/ </li></ul></ul><ul><li>For today, </li></ul><ul><ul><li>we will operate entirely in the directory c:/Perl/eg/ </li></ul></ul><ul><ul><li>To get there, </li></ul></ul><ul><ul><ul><li>open Programs -> Accessories -> Command Prompt </li></ul></ul></ul><ul><ul><ul><li>At the prompt, type c: </li></ul></ul></ul><ul><ul><ul><li>Type cd Perl/eg </li></ul></ul></ul><ul><li>(In your particular installation, or in a Mac, or something like Unix on high performance computing, these details will be different.) </li></ul>
  • 11. The First Perl Program <ul><li>Go to the QuaSSI Website for the example scripts for todays workshop: </li></ul><ul><ul><li>http:// qssi.psu.edu/files/howdy.pl </li></ul></ul><ul><li>Right-click on the first script, “howdy.pl”, and save it to c:Perleg </li></ul><ul><li>Open up the text-editor WinEdt (you could use almost anything) and then open howdy.pl </li></ul><ul><li>That’s a complete Perl program. </li></ul><ul><li>Note: that’s all a program is – a text file. </li></ul>
  • 12. Running a Perl Program <ul><li>Go back to your command prompt. </li></ul><ul><li>Type perl howdy.pl –w </li></ul><ul><li>(The –w tells perl to give you w arnings about what might be wrong if the program is broken.) </li></ul>
  • 13. Modifying a program <ul><li>Go back to WinEdt </li></ul><ul><li>Edit the text between the quotation marks to say something new </li></ul><ul><li>Click File -> Save </li></ul><ul><li>Go back to the command prompt </li></ul><ul><li>Hit the up arrow (to get the last command, perl howdy.pl –w </li></ul><ul><li>Look at that – you’re a programmer! </li></ul>
  • 14. Break the program <ul><li>Go back to WinEdt </li></ul><ul><li>Delete the semicolon at the end of the line </li></ul><ul><li>Save the file </li></ul><ul><li>Go back to the command prompt and run the program, with –w , again </li></ul><ul><li>What happened? </li></ul>
  • 15. Perl at 30,000 feet <ul><li>Much of the next set of slides is stolen shamelessly from Andy Tester’s “Perl at 10,000 Feet” at www.petdance.com </li></ul><ul><li>(I’m skipping even more than he did.) </li></ul>
  • 16. Some generalities about Perl <ul><li>Statements in Perl are, or usually can be, constructed in a fairly natural English-like way. </li></ul><ul><li>There are many ways to do any one thing. </li></ul><ul><li>The syntax can be offputting and hard to read, especially at first. It is easy to “obfuscate” Perl code and this is sometimes done intentionally. </li></ul><ul><li>Main syntax rule: end all lines with ; </li></ul>
  • 17. Data Types <ul><li>Scalars </li></ul><ul><li>Arrays and Lists </li></ul><ul><li>Hashes </li></ul><ul><li>References </li></ul><ul><li>Filehandles </li></ul><ul><li>Objects </li></ul>
  • 18. Scalars <ul><li>Numbers </li></ul><ul><ul><li>Generally decimal floating point </li></ul></ul><ul><ul><li>(Can be made integer, octal, hexadecimal) </li></ul></ul><ul><li>Strings </li></ul><ul><ul><li>Can contain any character </li></ul></ul><ul><ul><li>Can be null: “” </li></ul></ul><ul><ul><li>Can be arbitrarily large </li></ul></ul>
  • 19. Strings <ul><li>Single-quoted </li></ul><ul><ul><li>characters are as shown with only two exceptions. </li></ul></ul><ul><ul><ul><li>single-quote in a single-quoted string requires ’ </li></ul></ul></ul><ul><ul><ul><li>backslash in a single-quoted string requires </li></ul></ul></ul><ul><li>Double-quoted </li></ul><ul><ul><li>it will interpolate – calculate variables or control sequences. </li></ul></ul><ul><ul><ul><li>For example </li></ul></ul></ul><ul><ul><ul><ul><li>$foo = “myfile”; </li></ul></ul></ul></ul><ul><ul><ul><ul><li>$datafile = “$foo.txt”; </li></ul></ul></ul></ul><ul><ul><ul><ul><li>will result in the variable $datafile holding the string “myfile.txt” </li></ul></ul></ul></ul><ul><ul><ul><li>Another example </li></ul></ul></ul><ul><ul><ul><ul><li>print ‘Howdy ’; will print: </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Howdy </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>print “Howdy ”; will print </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Howdy </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>( is a control sequence, standing for “new line”). </li></ul></ul></ul></ul>
  • 20. Scalar operators <ul><li>Math </li></ul><ul><ul><li>*, /, % (for modulo), ** (for exponentiation), etc. </li></ul></ul><ul><li>Strings </li></ul><ul><ul><li>x to repeat the thing on the left </li></ul></ul><ul><ul><ul><li>“ b” x 10 gives “bbbbbbbbbb” </li></ul></ul></ul><ul><ul><li>. concatenates strings </li></ul></ul><ul><ul><ul><li>(“na” x 16).“ Batman!” gives ... </li></ul></ul></ul><ul><li>Perl knows to convert when mixing these two types: </li></ul><ul><ul><li>“ 3”*4 gives 12 </li></ul></ul><ul><ul><li>“ 3”.4 gives “34” </li></ul></ul>
  • 21. Comparing Scalars <ul><li>Comparison Numeric String </li></ul><ul><li>Equal == eq </li></ul><ul><li>Not equal != ne </li></ul><ul><li>Less than < lt </li></ul><ul><li>Greater than > gt </li></ul><ul><li>Less / equal <= le </li></ul><ul><li>Greater / equal >= ge </li></ul><ul><li>8 < 25 TRUE! </li></ul><ul><li>“ 8” lt “25” FALSE! </li></ul>
  • 22. Variables <ul><li>A sign, followed by a letter, followed by pretty much whatever. </li></ul><ul><li>Sign determines the type: </li></ul><ul><ul><li>$foo is a scalar </li></ul></ul><ul><ul><li>@foo is a list </li></ul></ul><ul><ul><li>%foo is a hash </li></ul></ul><ul><li>Variables default to global (they apply in all parts of your program). This can be problematic. </li></ul><ul><ul><li>local $var will make the variable active only for the current “block” of code. </li></ul></ul><ul><ul><li>my $var does the same, and is the more usual construction. </li></ul></ul><ul><ul><li>the very common use strict ; at the beginning of code forces good practice in the use of local variables (creates more syntax errors, but prevents more whoppers that could blow everything up.) </li></ul></ul>
  • 23. Lists and Arrays <ul><li>A list is an ordered set of (usually) scalars. </li></ul><ul><li>An array is a variable holding a list. </li></ul><ul><li>my @foo = (1,2,3) </li></ul><ul><li>my @bar = (“elephant”, 3.14) </li></ul><ul><li>Can be constructed as lists of scalar variables: </li></ul><ul><ul><li>my @data = ($name, $address, $SSN) </li></ul></ul>
  • 24. Using Arrays <ul><li>Elements are indexed, from 0. </li></ul><ul><ul><li>my @animals = (“frog”, “bear”, “elephant”); </li></ul></ul><ul><ul><li>print $animals[2]; # prints elephant </li></ul></ul><ul><ul><li>Note: element is a scalar, so $ rather than @ </li></ul></ul><ul><li>Subsections are “slices”. </li></ul><ul><ul><li>my @mammals = @animals[1,2]; </li></ul></ul><ul><li>Lots of functions for </li></ul><ul><ul><li>using as a stack (moving things on and off the right or left side of the array). </li></ul></ul><ul><ul><li>sorting </li></ul></ul><ul><ul><li>joining two arrays </li></ul></ul><ul><ul><li>splitting a scalar string into an array </li></ul></ul><ul><ul><ul><li>my $sentence = “This is my sentence.”; </li></ul></ul></ul><ul><ul><ul><li>my @words = split(“ “, $sentence); </li></ul></ul></ul><ul><ul><ul><li># now @words contains (“This”, “is”, “my”, “sentence”); </li></ul></ul></ul>
  • 25. Programming Controls <ul><li>Control structures </li></ul><ul><ul><li>if / then / elsif / else </li></ul></ul><ul><ul><li>while </li></ul></ul><ul><ul><li>do {} while </li></ul></ul><ul><ul><li>do {} until </li></ul></ul><ul><ul><li>for () </li></ul></ul><ul><ul><li>foreach() # loops over a list </li></ul></ul><ul><li>Errors / warnings </li></ul><ul><ul><li>die “message” kills program and prints “message”. </li></ul></ul><ul><ul><li>warn “message” prints message and keeps going. </li></ul></ul>
  • 26. Hashes <ul><li>“ Associative arrays” </li></ul><ul><li>A set of </li></ul><ul><ul><li>values (any scalar), indexed by </li></ul></ul><ul><ul><li>keys (strings) </li></ul></ul><ul><li>Example </li></ul><ul><ul><li>my %info; </li></ul></ul><ul><ul><li>$info{ “name” } = “Burt Monroe”; </li></ul></ul><ul><ul><li>$info{ “age” } = 39; </li></ul></ul><ul><li>With hashes and arrays you can create almost any arbitrary data structure (even arrays of arrays, arrays of hashes, hashes of arrays, etc.) </li></ul>
  • 27. File Handling <ul><li>open() function opens a file for processing. </li></ul><ul><li>Prefix the filename to define how </li></ul><ul><ul><li>“ <“ for input from existing file (read) </li></ul></ul><ul><ul><li>“ >” to create for output (write) </li></ul></ul><ul><ul><li>“ >>” to append to a file (that may not yet exist) </li></ul></ul><ul><li>open (IN, “<myfile.txt”) or die “Can’t open myfile.txt”; </li></ul><ul><li>Can then use <> to refer to the file. The above would be <IN>. </li></ul>
  • 28. Matching string patterns using regular expressions <ul><li>This is where much of the power of Perl lies. </li></ul><ul><li>m/pattern/ will check the last stored variable ( $_ ) for pattern. </li></ul><ul><li>$var =~ m/pattern/; will check $var for pattern. </li></ul><ul><li>If the pattern is in $var, then </li></ul><ul><ul><li>$var =~ m/pattern/ is TRUE. </li></ul></ul><ul><li>If you “group” part of the pattern and it is present, </li></ul><ul><ul><li>$var =~ m/(pattern)/ is true, AND, now a variable names $1 contains the first match it found. </li></ul></ul><ul><ul><li>Group more pieces of the pattern and the matches are stored in $2, $3, etc. </li></ul></ul><ul><li>This only grabs the *first* match. To grab all, say </li></ul><ul><ul><li>my @matches = ($var =~ m/(pattern)/g); </li></ul></ul><ul><ul><li>This will store every match in the array @matches. </li></ul></ul>
  • 29. What’s a “regular expression”? <ul><li>Combination of </li></ul><ul><ul><li>any literal character, number, etc. </li></ul></ul><ul><ul><li>. any single character </li></ul></ul><ul><ul><li>* zero or more of the previous </li></ul></ul><ul><ul><li>+ one or more of the previous </li></ul></ul><ul><ul><li>? zero or one of the previous </li></ul></ul><ul><ul><li>[aeiou] character class – this is the vowels </li></ul></ul><ul><ul><li>^ beginning of the line </li></ul></ul><ul><ul><li>$ end of the line </li></ul></ul><ul><ul><li> word boundary </li></ul></ul><ul><ul><li>d D digit / non-digit </li></ul></ul><ul><ul><li>s S space / non-space </li></ul></ul><ul><ul><li>w W word character / non-word character </li></ul></ul><ul><ul><li>| or – match this or that </li></ul></ul><ul><ul><li>() grouping </li></ul></ul><ul><li>See handout for more. </li></ul>
  • 30. Examples <ul><li>Romeo|Juliet “Romeo” or “Juliet” </li></ul><ul><li>ddd-dddd a phone number </li></ul><ul><li>(ddd-)?ddd-dddd phone #, maybe w/ area </li></ul><ul><li>[aeiou]w+ a word starting w/ a vowel </li></ul><ul><li>[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4} email add. </li></ul>
  • 31. Modules <ul><li>Hundreds of modules / packages available through cpan. </li></ul><ul><li>ActivePerl gives a GUI for installing them in its “Perl Package Manager”. </li></ul>
  • 32. A basic Perl example <ul><li>Counting words. </li></ul><ul><ul><li>counter1.pl </li></ul></ul>
  • 33. Grabbing from the web <ul><li>The basic idea is simply to have Perl act as an “agent”, in the way a browser like Explorer or Firefox does -- requesting and interpreting webpages. </li></ul><ul><li>There are a few basic modules that can do this. </li></ul>
  • 34. LWP::Simple <ul><li>lwpsimpleget.pl </li></ul>
  • 35. LWP::UserAgent <ul><li>More elaborate than LWP::Simple. </li></ul><ul><li>I’m going to skip that one today, but it’s covered in details in the main books </li></ul><ul><ul><li>Perl & LWP </li></ul></ul><ul><ul><li>Spidering Hacks </li></ul></ul><ul><li>Pretty much all of the functionality has been wrapped more intuitively into ... </li></ul>
  • 36. WWW::Mechanize <ul><li>mechanizeget.pl </li></ul>
  • 37. Scraping <ul><li>At its base, this is just extracting information from the page(s) you download. </li></ul><ul><li>Simple example: </li></ul><ul><ul><li>freshair.pl </li></ul></ul>
  • 38. Your agent can interact ... <ul><li>For example, what if the webpage involves a form ... </li></ul><ul><li>Example </li></ul><ul><ul><li>abstracts.pl </li></ul></ul><ul><li>You can authenticate with username and password, run through proxy servers, and so on. </li></ul>
  • 39. Spiders <ul><li>Type 1 Requester </li></ul><ul><ul><li>Requests a few items with known urls from a website. </li></ul></ul><ul><li>Type 2 Requester </li></ul><ul><ul><li>Requests a few items, then requests (some set of) pages to which those items link. </li></ul></ul><ul><li>Type 3 Requester </li></ul><ul><ul><li>Starts at a given url, and then requests everything linked, everything linked by that, etc. at the same host server . The idea here is usually to download an entire website. </li></ul></ul><ul><li>Type 4 Requester </li></ul><ul><ul><li>Starts at a given url, requests everything linked anywhere , everything linked by that, etc. until it, perhaps, visits the entire web. </li></ul></ul><ul><li>YOU – I am talking to YOU – in all likelihood have no business writing Type 3 or Type 4 spiders. These can easily go seriously awry causing mayhem of many sorts. Write only spiders with known finite scope. </li></ul>
  • 40. Back to the Luxembourg Miner <ul><li>Commune-level election results from Luxembourg. </li></ul><ul><ul><li>luxembourg.pl </li></ul></ul>
  • 41. More on Scraping <ul><li>All of the examples scraped / parsed using regular expressions. </li></ul><ul><li>More structured data like HTML is often better (or only) addressed with more specialized tools: </li></ul><ul><ul><li>HTML::TokeParser </li></ul></ul><ul><ul><li>HTML::TreeBuilder </li></ul></ul><ul><li>There are modules for scraping from XML, spreadsheets, databases, Word docs, PDFs. </li></ul>

×