feedparser http://www.feedparser.org/ Because RSS is Hairy Lindsey Smith @turbodog
feedparser: because RSS is hairy <ul><li>RSS formats bundle HTML  </li></ul><ul><li>User input via HTML  is hairy </li></u...
feedparser: because rss is hairy <ul><li>Download and parse just about any feed type, including:  </li></ul><ul><ul><li>Va...
feedparser: because rss is hairy <ul><li>Digests whatever crap you throw at it </li></ul><ul><ul><li>Sanitizes HTML </li><...
feedparser: because rss is hairy <ul><li>Parse URL, local file or string data </li></ul><ul><li>304 Not Modified  HTTP ret...
feedparser: the good ol’ days <ul><li>Created circa 2002 by Mark Pilgrim of  Dive Into Python  fame </li></ul><ul><li>Powe...
feedparser: the lean years <ul><li>Development slows to a trickle </li></ul><ul><li>No official releases </li></ul><ul><li...
feedparser 5.0: a new hope <ul><li>Small group of developers start working on feedparser </li></ul><ul><li>v5.0 released J...
<ul><li>>>>  import  feedparser  </li></ul><ul><li>>>> d = feedparser.parse( &quot; http://feedparser.org/docs/examples/at...
<ul><li>>>> len(d['entries'])  # entries are a list   </li></ul><ul><li>1   </li></ul><ul><li>>>> d['entries'][0]['title']...
<ul><li>>>> e.updated_parsed  # parses all date formats   </li></ul><ul><li>time.struct_time(tm_year=2005, tm_mon=11, tm_m...
feedparser: caveats <ul><li>Fairly slow and CPU intensive </li></ul><ul><ul><li>Friendfeed rolled their own and fell back ...
feedparser: the project details <ul><li>Home page:  http://www.feedparser.org </li></ul><ul><li>Discussion:  http://code.g...
Upcoming SlideShare
Loading in …5
×

Feedparser

6,760 views

Published on

Brief overview of the Python feedparser module. Feedparser is a robust tool for parsing RSS feeds of all types.

Published in: Technology

Feedparser

  1. 1. feedparser http://www.feedparser.org/ Because RSS is Hairy Lindsey Smith @turbodog
  2. 2. feedparser: because RSS is hairy <ul><li>RSS formats bundle HTML </li></ul><ul><li>User input via HTML is hairy </li></ul><ul><li>There are several syndication formats and versions (RSS, Atom, etc.) </li></ul>RSS HTML Micro-format
  3. 3. feedparser: because rss is hairy <ul><li>Download and parse just about any feed type, including: </li></ul><ul><ul><li>Various flavors of Atom and RSS </li></ul></ul><ul><ul><li>Format extensions (iTunes) </li></ul></ul><ul><ul><li>Micro-formats (GeoRSS, hcard) </li></ul></ul><ul><li>Ensures that you can treat all feeds the same way, regardless of format or version </li></ul>
  4. 4. feedparser: because rss is hairy <ul><li>Digests whatever crap you throw at it </li></ul><ul><ul><li>Sanitizes HTML </li></ul></ul><ul><ul><li>Date normalization </li></ul></ul><ul><ul><li>Resolving relative links </li></ul></ul><ul><ul><li>Feed type, version and encoding detection </li></ul></ul><ul><ul><li>Bozo detection of non-well-formed feeds without blowing up </li></ul></ul>
  5. 5. feedparser: because rss is hairy <ul><li>Parse URL, local file or string data </li></ul><ul><li>304 Not Modified HTTP return code </li></ul><ul><li>HTTP basic auth </li></ul><ul><li>Custom request headers </li></ul><ul><li>Customer handlers </li></ul><ul><li>Captures response headers </li></ul>
  6. 6. feedparser: the good ol’ days <ul><li>Created circa 2002 by Mark Pilgrim of  Dive Into Python  fame </li></ul><ul><li>Powers feedvalidator.org </li></ul><ul><li>v4.1 released in 2007 </li></ul><ul><ul><li>Open source </li></ul></ul><ul><ul><li>Well-documented </li></ul></ul><ul><ul><li>3000 unit tests </li></ul></ul><ul><ul><li>Available in popular Linux distros </li></ul></ul>
  7. 7. feedparser: the lean years <ul><li>Development slows to a trickle </li></ul><ul><li>No official releases </li></ul><ul><li>Atom & RSS continue to evolve </li></ul><ul><ul><li>iTunes enclosures </li></ul></ul><ul><li>v4.1 released in 2007 </li></ul><ul><ul><li>Still available in popular Linux distros </li></ul></ul>
  8. 8. feedparser 5.0: a new hope <ul><li>Small group of developers start working on feedparser </li></ul><ul><li>v5.0 released January 2011 </li></ul><ul><ul><li>Supports Python 3 </li></ul></ul><ul><ul><li>Micro-formats </li></ul></ul><ul><ul><li>CSS & HTML5 sanitation </li></ul></ul><ul><ul><li>Bug fixes, bug fixes, bug fixes </li></ul></ul>
  9. 9. <ul><li>>>> import feedparser </li></ul><ul><li>>>> d = feedparser.parse( &quot; http://feedparser.org/docs/examples/atom10.xml &quot; ) </li></ul><ul><li>>>> d['feed']['title'] # feed data is a dictionary </li></ul><ul><li>u'Sample Feed' </li></ul><ul><li>>>> d.feed.title # get values attr-style or dict-style </li></ul><ul><li>u'Sample Feed' </li></ul><ul><li>>>> d.channel.title # use RSS or Atom terminology anywhere </li></ul><ul><li>u'Sample Feed' </li></ul><ul><li>>>> d.feed.link # resolves relative links </li></ul><ul><li>u'http://example.org/' </li></ul><ul><li>>>> d.feed.subtitle # parses escaped HTML </li></ul><ul><li>u'For documentation <em>only</em>' </li></ul>
  10. 10. <ul><li>>>> len(d['entries']) # entries are a list </li></ul><ul><li>1 </li></ul><ul><li>>>> d['entries'][0]['title'] # each entry is a dictionary </li></ul><ul><li>u'First entry title' </li></ul><ul><li>>>> d.entries[0].title # attr-style works here too </li></ul><ul><li>u'First entry title' </li></ul><ul><li>>>> d['items'][0].title # RSS terminology works here too </li></ul><ul><li>u'First entry title' </li></ul><ul><li>>>> e = d.entries[0] </li></ul><ul><li>>>> e.link # easy access to alternate link </li></ul><ul><li>u'http://example.org/entry/3' </li></ul><ul><li>>>> e.links[1].rel # full access to all Atom links </li></ul><ul><li>u'related' </li></ul><ul><li>>>> e.links[0].href # resolves relative links here too </li></ul><ul><li>u'http://example.org/entry/3' </li></ul>
  11. 11. <ul><li>>>> e.updated_parsed # parses all date formats </li></ul><ul><li>time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0) </li></ul><ul><li>>>> e.content[0].value # sanitizes dangerous HTML </li></ul><ul><li>u'<div>Watch out for <em>nasty tricks</em></div>' </li></ul><ul><li>>>> d.version # reports feed type and version </li></ul><ul><li>u'atom10' </li></ul><ul><li>>>> d.encoding # auto-detects character encoding </li></ul><ul><li>u'utf-8' </li></ul><ul><li>>>> d.headers.get('Content-type') # full access to all HTTP headers </li></ul><ul><li>u'application/xml‘ </li></ul><ul><li>>>> d.bozo # well-formed? </li></ul><ul><li>0 </li></ul>
  12. 12. feedparser: caveats <ul><li>Fairly slow and CPU intensive </li></ul><ul><ul><li>Friendfeed rolled their own and fell back on feedparser </li></ul></ul><ul><li>Team is looking at ways to speed it up </li></ul>
  13. 13. feedparser: the project details <ul><li>Home page: http://www.feedparser.org </li></ul><ul><li>Discussion: http://code.google.com/p/feedparser </li></ul>

×