Your SlideShare is downloading. ×
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Feedparser
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Feedparser

5,352

Published on

Brief overview of the Python feedparser module. Feedparser is a robust tool for parsing RSS feeds of all types.

Brief overview of the Python feedparser module. Feedparser is a robust tool for parsing RSS feeds of all types.

Published in: Technology
1 Comment
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,352
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
1
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. feedparser http://www.feedparser.org/ Because RSS is Hairy Lindsey Smith @turbodog
  • 2. feedparser: because RSS is hairy
    • RSS formats bundle HTML
    • User input via HTML is hairy
    • There are several syndication formats and versions (RSS, Atom, etc.)
    RSS HTML Micro-format
  • 3. feedparser: because rss is hairy
    • Download and parse just about any feed type, including:
      • Various flavors of Atom and RSS
      • Format extensions (iTunes)
      • Micro-formats (GeoRSS, hcard)
    • Ensures that you can treat all feeds the same way, regardless of format or version
  • 4. feedparser: because rss is hairy
    • Digests whatever crap you throw at it
      • Sanitizes HTML
      • Date normalization
      • Resolving relative links
      • Feed type, version and encoding detection
      • Bozo detection of non-well-formed feeds without blowing up
  • 5. feedparser: because rss is hairy
    • Parse URL, local file or string data
    • 304 Not Modified HTTP return code
    • HTTP basic auth
    • Custom request headers
    • Customer handlers
    • Captures response headers
  • 6. feedparser: the good ol’ days
    • Created circa 2002 by Mark Pilgrim of  Dive Into Python  fame
    • Powers feedvalidator.org
    • v4.1 released in 2007
      • Open source
      • Well-documented
      • 3000 unit tests
      • Available in popular Linux distros
  • 7. feedparser: the lean years
    • Development slows to a trickle
    • No official releases
    • Atom & RSS continue to evolve
      • iTunes enclosures
    • v4.1 released in 2007
      • Still available in popular Linux distros
  • 8. feedparser 5.0: a new hope
    • Small group of developers start working on feedparser
    • v5.0 released January 2011
      • Supports Python 3
      • Micro-formats
      • CSS & HTML5 sanitation
      • Bug fixes, bug fixes, bug fixes
  • 9.
    • >>> import feedparser
    • >>> d = feedparser.parse( " http://feedparser.org/docs/examples/atom10.xml " )
    • >>> d['feed']['title'] # feed data is a dictionary
    • u'Sample Feed'
    • >>> d.feed.title # get values attr-style or dict-style
    • u'Sample Feed'
    • >>> d.channel.title # use RSS or Atom terminology anywhere
    • u'Sample Feed'
    • >>> d.feed.link # resolves relative links
    • u'http://example.org/'
    • >>> d.feed.subtitle # parses escaped HTML
    • u'For documentation <em>only</em>'
  • 10.
    • >>> len(d['entries']) # entries are a list
    • 1
    • >>> d['entries'][0]['title'] # each entry is a dictionary
    • u'First entry title'
    • >>> d.entries[0].title # attr-style works here too
    • u'First entry title'
    • >>> d['items'][0].title # RSS terminology works here too
    • u'First entry title'
    • >>> e = d.entries[0]
    • >>> e.link # easy access to alternate link
    • u'http://example.org/entry/3'
    • >>> e.links[1].rel # full access to all Atom links
    • u'related'
    • >>> e.links[0].href # resolves relative links here too
    • u'http://example.org/entry/3'
  • 11.
    • >>> e.updated_parsed # parses all date formats
    • time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)
    • >>> e.content[0].value # sanitizes dangerous HTML
    • u'<div>Watch out for <em>nasty tricks</em></div>'
    • >>> d.version # reports feed type and version
    • u'atom10'
    • >>> d.encoding # auto-detects character encoding
    • u'utf-8'
    • >>> d.headers.get('Content-type') # full access to all HTTP headers
    • u'application/xml‘
    • >>> d.bozo # well-formed?
    • 0
  • 12. feedparser: caveats
    • Fairly slow and CPU intensive
      • Friendfeed rolled their own and fell back on feedparser
    • Team is looking at ways to speed it up
  • 13. feedparser: the project details
    • Home page: http://www.feedparser.org
    • Discussion: http://code.google.com/p/feedparser

×