SlideShare a Scribd company logo
Introduction Example Regex Other Methods PDFs

BeautifulSoup: Web Scraping with Python
Andrew Peterson

Apr 9, 2013

files available at:
https://github.com/aristotle-tek/BeautifulSoup_pres

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Roadmap

Uses: data types, examples...
Getting Started
downloading files with wget
BeautifulSoup: in depth example - election results table
Additional commands, approaches
PDFminer
(time permitting) additional examples

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Etiquette/ Ethics

Similar rules of etiquette apply as Pablo mentioned:
Limit requests, protect privacy, play nice...

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)
data formats: XML, JSON
PDF
APIs
other languages of the web: css, java, php, asp.net...
(don’t forget existing datasets)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup

General purpose, robust, works with broken tags
Parses html and xml, including fixing asymmetric tags, etc.
Returns unicode text strings
Alternatives: lxml (also parses html), Scrapey
Faster alternatives: ElementTree, SGMLParser (custom)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Installation

pip install beautifulsoup4 or
easy_install beautifulsoup4
See: http://www.crummy.com/software/BeautifulSoup/
On installing libraries:
http://docs.python.org/2/install/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Table basics

<table> Defines a table
<th> Defines a header cell in a table
<tr> Defines a row in a table
<td> Defines a cell in a table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

<h4>Simple
<table>
<tr>
<td>[r1,
<td>[r1,
</tr>
<tr>
<td>[r2,
<td>[r2,
</tr>
</table>

table:</h4>

c1] </td>
c2] </td>

c1]</td>
c2]</td>

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

election results spread across hundreds of pages
want to quickly put in useable format (e.g. csv)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

website might change at any moment
ability to replicate research
limits page requests

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

I use wget (GNU), which can be called from within python
alternatively cURL may be better for macs, or scrapy

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

wget: note the --no-parent option!

os.system("wget --convert-links --no-clobber 
--wait=4 
--limit-rate=10K 
-r --no-parent http://www.necliberia.org/results2011/results

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Step one: view page source

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Outline of Our Approach

1
2

identify the county and precinct number
get the table:
identify the correct table
put the rows into a list
for each row, identify cells
use regular expressions to identify the party & lastname

3

write a row to the csv file

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Open a page

soup = BeautifulSoup(html_doc)
Print all: print(soup.prettify())
Print text: print(soup.get_text())

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Navigating the Page Structure

some sites use div, others put everything in tables.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

find all

finds all the Tag and NavigableString objects that match the
criteria you give.
find table rows: find_all("tr")
e.g.:
for link in soup.find_all(’a’):
print(link.get(’href’))

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Let’s try it out

We’ll run through the code step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Allow precise and flexible matching of strings
precise: i.e. character-by-character (including spaces, etc)
flexible: specify a set of allowable characters, unknown
quantities
import re

BeautifulSoup
Introduction Example Regex Other Methods PDFs

from xkcd

(Licensed under Creative Commons Attribution-NonCommercial 2.5 License.)
BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: metacharacters

Metacharacters:
|. ^ $ * + ? { } [ ]  | ( )
excape metacharacters with backslash 

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Character class

brackets [ ] allow matching of any element they contain
[A-Z] matches a capital letter, [0-9] matches a number
[a-z][0-9] matches a lowercase letter followed by a number

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Repeat

star * matches the previous item 0 or more times
plus + matches the previous item 1 or more times
[A-Za-z]* would match only the first 3 chars of Xpr8r

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: match anything

dot . will match anything but line break characters r n
combined with * or + is very hungry!

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: or, optional

pipe is for ‘or’
‘abc|123’ matches ‘abc’ or ‘123’ but not ‘ab3’
question makes the preceeding item optional: c3?[a-z]+
would match c3po and also cpu

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: in reverse

parser starts from beginning of string
can tell it to start from the end with $

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

d, w and s
D, W and S NOT digit (use outside char class)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Now let’s see some examples and put this to use to get the
party.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions: Getting headers, titles, body

soup.head
soup.title
soup.body

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions

soup.b
id: soup.find_all(id="link2")
eliminate from the tree: decompose()

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Other Methods: Navigating the Parse Tree

With parent you move up the parse tree. With contents
you move down the tree.
contents is an ordered list of the Tag and NavigableString
objects contained within a page element.
nextSibling and previousSibling: skip to the next or
previous thing on the same level of the parse tree

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data output

Create simple csv files: import csv
many other possible methods: e.g. use within a pandas
DataFrame (cf Wes McKinney)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Putting it all together

Loop over files
for vote total rows, make the party empty
print each row with the county and precinct number as
columns

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

Can extract text, looping over 100s or 1,000s of pdfs.
not based on character recognition (OCR)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

There are other packages, but pdfminer is focused more
directly on scraping (rather than creating) pdfs.
Can be executed in a single command, or step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

We’ll look at just using it within python in a single command,
outputting to a .txt file.
Sample pdfs from the National Security Archive Iraq War:
http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB418/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Performing an action over all files

Often useful to do something over all files in a folder.
One way to do this is with glob:
import glob
for filename in glob.glob(’/filepath/*.pdf’):
print filename
see also an example file with pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Additional Examples

(time permitting): newspapers, output to pandas...

BeautifulSoup

More Related Content

What's hot

JavaScript: Variables and Functions
JavaScript: Variables and FunctionsJavaScript: Variables and Functions
JavaScript: Variables and Functions
Jussi Pohjolainen
 
A Basic Django Introduction
A Basic Django IntroductionA Basic Django Introduction
A Basic Django Introduction
Ganga Ram
 
1 03 - CSS Introduction
1 03 - CSS Introduction1 03 - CSS Introduction
1 03 - CSS Introduction
apnwebdev
 
Introduction to django
Introduction to djangoIntroduction to django
Introduction to django
Ilian Iliev
 

What's hot (20)

Django Interview Questions and Answers
Django Interview Questions and AnswersDjango Interview Questions and Answers
Django Interview Questions and Answers
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Json
JsonJson
Json
 
JavaScript: Variables and Functions
JavaScript: Variables and FunctionsJavaScript: Variables and Functions
JavaScript: Variables and Functions
 
Django for Beginners
Django for BeginnersDjango for Beginners
Django for Beginners
 
Python/Flask Presentation
Python/Flask PresentationPython/Flask Presentation
Python/Flask Presentation
 
A Basic Django Introduction
A Basic Django IntroductionA Basic Django Introduction
A Basic Django Introduction
 
Django Introduction & Tutorial
Django Introduction & TutorialDjango Introduction & Tutorial
Django Introduction & Tutorial
 
An introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceAn introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable service
 
1 03 - CSS Introduction
1 03 - CSS Introduction1 03 - CSS Introduction
1 03 - CSS Introduction
 
JSON: The Basics
JSON: The BasicsJSON: The Basics
JSON: The Basics
 
Introduction to django
Introduction to djangoIntroduction to django
Introduction to django
 
Python Flask Tutorial For Beginners | Flask Web Development Tutorial | Python...
Python Flask Tutorial For Beginners | Flask Web Development Tutorial | Python...Python Flask Tutorial For Beginners | Flask Web Development Tutorial | Python...
Python Flask Tutorial For Beginners | Flask Web Development Tutorial | Python...
 
Web development with Python
Web development with PythonWeb development with Python
Web development with Python
 
Introduction To Django
Introduction To DjangoIntroduction To Django
Introduction To Django
 
Flask Introduction - Python Meetup
Flask Introduction - Python MeetupFlask Introduction - Python Meetup
Flask Introduction - Python Meetup
 
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
 
Php
PhpPhp
Php
 
Control Structures In Php 2
Control Structures In Php 2Control Structures In Php 2
Control Structures In Php 2
 
Asynchronous JavaScript & XML (AJAX)
Asynchronous JavaScript & XML (AJAX)Asynchronous JavaScript & XML (AJAX)
Asynchronous JavaScript & XML (AJAX)
 

Viewers also liked

Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
Jim Chang
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 
CITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology CertificateCITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology Certificate
Terry Bu
 

Viewers also liked (20)

Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Caring for Sharring
 Caring for Sharring  Caring for Sharring
Caring for Sharring
 
CITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology CertificateCITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology Certificate
 
TJOLI
TJOLITJOLI
TJOLI
 
Sincere Speech
Sincere SpeechSincere Speech
Sincere Speech
 
Opjo
OpjoOpjo
Opjo
 

Similar to Beautiful soup

Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
Dave Cross
 
PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)
Concentrated Technology
 

Similar to Beautiful soup (20)

Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
 
PDF Generation in Rails with Prawn and Prawn-to: John McCaffrey
PDF Generation in Rails with Prawn and Prawn-to: John McCaffreyPDF Generation in Rails with Prawn and Prawn-to: John McCaffrey
PDF Generation in Rails with Prawn and Prawn-to: John McCaffrey
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
PARADIGM IT.pptx
PARADIGM IT.pptxPARADIGM IT.pptx
PARADIGM IT.pptx
 
How Groovy Helps
How Groovy HelpsHow Groovy Helps
How Groovy Helps
 
Perly Parsing with Regexp::Grammars
Perly Parsing with Regexp::GrammarsPerly Parsing with Regexp::Grammars
Perly Parsing with Regexp::Grammars
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011
 
Jiscad viz
Jiscad vizJiscad viz
Jiscad viz
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Sa
SaSa
Sa
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
 
Perl Teach-In (part 1)
Perl Teach-In (part 1)Perl Teach-In (part 1)
Perl Teach-In (part 1)
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)
 

More from mustafa sarac

More from mustafa sarac (20)

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma son
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Latka december digital
Latka december digitalLatka december digital
Latka december digital
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manual
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpy
 
Math for programmers
Math for programmersMath for programmers
Math for programmers
 
The book of Why
The book of WhyThe book of Why
The book of Why
 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodu
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimiz
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mi
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Markets
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimi
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guide
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dice
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 

Beautiful soup

  • 1. Introduction Example Regex Other Methods PDFs BeautifulSoup: Web Scraping with Python Andrew Peterson Apr 9, 2013 files available at: https://github.com/aristotle-tek/BeautifulSoup_pres BeautifulSoup
  • 2. Introduction Example Regex Other Methods PDFs Roadmap Uses: data types, examples... Getting Started downloading files with wget BeautifulSoup: in depth example - election results table Additional commands, approaches PDFminer (time permitting) additional examples BeautifulSoup
  • 3. Introduction Example Regex Other Methods PDFs Etiquette/ Ethics Similar rules of etiquette apply as Pablo mentioned: Limit requests, protect privacy, play nice... BeautifulSoup
  • 4. Introduction Example Regex Other Methods PDFs Data/Page formats on the web HTML, HTML5 (<!DOCTYPE html>) BeautifulSoup
  • 5. Introduction Example Regex Other Methods PDFs Data/Page formats on the web HTML, HTML5 (<!DOCTYPE html>) data formats: XML, JSON PDF APIs other languages of the web: css, java, php, asp.net... (don’t forget existing datasets) BeautifulSoup
  • 6. Introduction Example Regex Other Methods PDFs BeautifulSoup General purpose, robust, works with broken tags Parses html and xml, including fixing asymmetric tags, etc. Returns unicode text strings Alternatives: lxml (also parses html), Scrapey Faster alternatives: ElementTree, SGMLParser (custom) BeautifulSoup
  • 7. Introduction Example Regex Other Methods PDFs Installation pip install beautifulsoup4 or easy_install beautifulsoup4 See: http://www.crummy.com/software/BeautifulSoup/ On installing libraries: http://docs.python.org/2/install/ BeautifulSoup
  • 8. Introduction Example Regex Other Methods PDFs HTML Table basics <table> Defines a table <th> Defines a header cell in a table <tr> Defines a row in a table <td> Defines a cell in a table BeautifulSoup
  • 9. Introduction Example Regex Other Methods PDFs HTML Tables BeautifulSoup
  • 10. Introduction Example Regex Other Methods PDFs HTML Tables <h4>Simple <table> <tr> <td>[r1, <td>[r1, </tr> <tr> <td>[r2, <td>[r2, </tr> </table> table:</h4> c1] </td> c2] </td> c1]</td> c2]</td> BeautifulSoup
  • 11. Introduction Example Regex Other Methods PDFs Example: Election data from html table BeautifulSoup
  • 12. Introduction Example Regex Other Methods PDFs Example: Election data from html table election results spread across hundreds of pages want to quickly put in useable format (e.g. csv) BeautifulSoup
  • 13. Introduction Example Regex Other Methods PDFs Download relevant pages website might change at any moment ability to replicate research limits page requests BeautifulSoup
  • 14. Introduction Example Regex Other Methods PDFs Download relevant pages I use wget (GNU), which can be called from within python alternatively cURL may be better for macs, or scrapy BeautifulSoup
  • 15. Introduction Example Regex Other Methods PDFs Download relevant pages wget: note the --no-parent option! os.system("wget --convert-links --no-clobber --wait=4 --limit-rate=10K -r --no-parent http://www.necliberia.org/results2011/results BeautifulSoup
  • 16. Introduction Example Regex Other Methods PDFs Step one: view page source BeautifulSoup
  • 17. Introduction Example Regex Other Methods PDFs Outline of Our Approach 1 2 identify the county and precinct number get the table: identify the correct table put the rows into a list for each row, identify cells use regular expressions to identify the party & lastname 3 write a row to the csv file BeautifulSoup
  • 18. Introduction Example Regex Other Methods PDFs Open a page soup = BeautifulSoup(html_doc) Print all: print(soup.prettify()) Print text: print(soup.get_text()) BeautifulSoup
  • 19. Introduction Example Regex Other Methods PDFs Navigating the Page Structure some sites use div, others put everything in tables. BeautifulSoup
  • 20. Introduction Example Regex Other Methods PDFs find all finds all the Tag and NavigableString objects that match the criteria you give. find table rows: find_all("tr") e.g.: for link in soup.find_all(’a’): print(link.get(’href’)) BeautifulSoup
  • 21. Introduction Example Regex Other Methods PDFs BeautifulSoup
  • 22. Introduction Example Regex Other Methods PDFs Let’s try it out We’ll run through the code step-by-step BeautifulSoup
  • 23. Introduction Example Regex Other Methods PDFs Regular Expressions Allow precise and flexible matching of strings precise: i.e. character-by-character (including spaces, etc) flexible: specify a set of allowable characters, unknown quantities import re BeautifulSoup
  • 24. Introduction Example Regex Other Methods PDFs from xkcd (Licensed under Creative Commons Attribution-NonCommercial 2.5 License.) BeautifulSoup
  • 25. Introduction Example Regex Other Methods PDFs Regular Expressions: metacharacters Metacharacters: |. ^ $ * + ? { } [ ] | ( ) excape metacharacters with backslash BeautifulSoup
  • 26. Introduction Example Regex Other Methods PDFs Regular Expressions: Character class brackets [ ] allow matching of any element they contain [A-Z] matches a capital letter, [0-9] matches a number [a-z][0-9] matches a lowercase letter followed by a number BeautifulSoup
  • 27. Introduction Example Regex Other Methods PDFs Regular Expressions: Repeat star * matches the previous item 0 or more times plus + matches the previous item 1 or more times [A-Za-z]* would match only the first 3 chars of Xpr8r BeautifulSoup
  • 28. Introduction Example Regex Other Methods PDFs Regular Expressions: match anything dot . will match anything but line break characters r n combined with * or + is very hungry! BeautifulSoup
  • 29. Introduction Example Regex Other Methods PDFs Regular Expressions: or, optional pipe is for ‘or’ ‘abc|123’ matches ‘abc’ or ‘123’ but not ‘ab3’ question makes the preceeding item optional: c3?[a-z]+ would match c3po and also cpu BeautifulSoup
  • 30. Introduction Example Regex Other Methods PDFs Regular Expressions: in reverse parser starts from beginning of string can tell it to start from the end with $ BeautifulSoup
  • 31. Introduction Example Regex Other Methods PDFs Regular Expressions d, w and s D, W and S NOT digit (use outside char class) BeautifulSoup
  • 32. Introduction Example Regex Other Methods PDFs Regular Expressions Now let’s see some examples and put this to use to get the party. BeautifulSoup
  • 33. Introduction Example Regex Other Methods PDFs Basic functions: Getting headers, titles, body soup.head soup.title soup.body BeautifulSoup
  • 34. Introduction Example Regex Other Methods PDFs Basic functions soup.b id: soup.find_all(id="link2") eliminate from the tree: decompose() BeautifulSoup
  • 35. Introduction Example Regex Other Methods PDFs Other Methods: Navigating the Parse Tree With parent you move up the parse tree. With contents you move down the tree. contents is an ordered list of the Tag and NavigableString objects contained within a page element. nextSibling and previousSibling: skip to the next or previous thing on the same level of the parse tree BeautifulSoup
  • 36. Introduction Example Regex Other Methods PDFs Data output Create simple csv files: import csv many other possible methods: e.g. use within a pandas DataFrame (cf Wes McKinney) BeautifulSoup
  • 37. Introduction Example Regex Other Methods PDFs Putting it all together Loop over files for vote total rows, make the party empty print each row with the county and precinct number as columns BeautifulSoup
  • 38. Introduction Example Regex Other Methods PDFs PDFs Can extract text, looping over 100s or 1,000s of pdfs. not based on character recognition (OCR) BeautifulSoup
  • 39. Introduction Example Regex Other Methods PDFs pdfminer There are other packages, but pdfminer is focused more directly on scraping (rather than creating) pdfs. Can be executed in a single command, or step-by-step BeautifulSoup
  • 40. Introduction Example Regex Other Methods PDFs pdfminer BeautifulSoup
  • 41. Introduction Example Regex Other Methods PDFs PDFs We’ll look at just using it within python in a single command, outputting to a .txt file. Sample pdfs from the National Security Archive Iraq War: http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB418/ BeautifulSoup
  • 42. Introduction Example Regex Other Methods PDFs Performing an action over all files Often useful to do something over all files in a folder. One way to do this is with glob: import glob for filename in glob.glob(’/filepath/*.pdf’): print filename see also an example file with pdfminer BeautifulSoup
  • 43. Introduction Example Regex Other Methods PDFs Additional Examples (time permitting): newspapers, output to pandas... BeautifulSoup